Upload
tambre
View
80
Download
0
Embed Size (px)
DESCRIPTION
Hot Interconnects 18. The PERCS High-Performance Interconnect. Baba Arimilli: Chief Architect. Ram Rajamony, Scott Clark. Outline. HPCS Program Background and Goals PERCS Topology POWER7 Hub Chip Overview HFI and Packet Flows ISR and Routing CAU Chip and Module Metrics Summary. - PowerPoint PPT Presentation
Citation preview
© 2010 IBM Corporation
The PERCS High-Performance Interconnect
Hot Interconnects 18
Baba Arimilli: Chief Architect
Ram Rajamony, Scott Clark
© 2010 IBM Corporation2
Outline
HPCS Program Background and Goals
PERCS Topology
POWER7 Hub Chip– Overview– HFI and Packet Flows– ISR and Routing– CAU– Chip and Module Metrics
Summary
This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.
“This design represents a tremendous increase in the use of optics in systems, and a disruptive transition from datacom- to computercom-style optical interconnect technologies.”
© 2010 IBM Corporation3
HPCS Background and
Goals
© 2010 IBM Corporation4
DARPA’s “High Productivity Computing Systems” Program
Impact:– Performance (time-to-solution): speedup by 10X to 40X– Programmability (idea-to-first-solution): dramatically reduce cost & development time – Portability (transparency): insulate software from system– Robustness (reliability): continue operating in the presence of localized hardware
failure, contain the impact of software defects, & minimize likelihood of operator error
Applications:
Ocean/wave Forecasting
Weather Prediction Ship DesignClimate
ModelingNuclear Stockpile
StewardshipWeapons
Integration
Goal: Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community
PERCS – Productive, Easy-to-use, Reliable Computing System is IBM’s response to DARPA’s HPCS Program
PERCS – Productive, Easy-to-use, Reliable Computing System is IBM’s response to DARPA’s HPCS Program
© 2010 IBM Corporation5
What The HPC Community Told Us They Needed
Maximum Core Performance– … to minimize number of cores needed for a given level of performance as
well as lessen the impact of sections of code with limited scalability
Low Latency, High Bandwidth Communications Fabric– … to maximize scalability of science and engineering applications
Large, Low Latency, High Bandwidth Memory Subsystem– … to enable the solution of memory-intensive problems
Large Capacity, High Bandwidth I/O Subsystem– … to enable the solution of data-intensive problems
Reliable Operation– … to enable long-running simulations
5
© 2010 IBM Corporation6
Design Goals
High bisection bandwidth
Low packet latency
High interconnect bandwidth (even for packets < 50 bytes)
© 2010 IBM Corporation7
Design Goals
High bisection bandwidth– Small fanout from hub chip necessitates mesh/toroidal topologies– Use very high fanout from hub chip to increase interconnect “reach”
Low packet latency– Large number of stages requires delays at every stage– Use a topology with very small number of hops
High interconnect bandwidth (even for packets < 50 bytes)– Architect switch pipeline to handle small packets– Automatically (in hardware) aggregate and disaggregate small packets
© 2010 IBM Corporation8
Water Conditioning UnitsAccepts Standard Building Chilled Water
Storage Enclosure
Bulk Power RegulatorsUniversal Power Input
End Result: The PERCS System Rack
POWER7 QCM
Hub Chip Module
© 2010 IBM Corporation9
PERCS Topology
© 2010 IBM Corporation10
Topology
Numerous Topologies evaluated
Converged on a Multi-level Direct Connect Topology to addresses the design goals of high bisection bandwidth and low latency
– Multiple levels– All elements in each level of the hierarchy are fully connected with each other
Examples:
© 2010 IBM Corporation11
Design Trade-offs
Narrowed options to 2-level and 3-level direct-connect topologies
Link bandwidth ratios determined by the number of levels– Let the levels be denoted using Z, L, D– 3 levels: longest direct path is ZLZ-D-ZLZ need Z:L:D bandwidth ratios at 4:2:1– 2 levels: longest direct path is L-D-L need L:D bandwidth ratios at 2:1
Maximum system size determined by the numbers of links of each type– 3 levels: Maximum system size is ~ Z4.L2.D1 .Number-of-POWER7s-per-Hub chip – 2 levels: Maximum system size is ~ L2.D1 .Number-of-POWER7s-per-Hub chip
© 2010 IBM Corporation12
Aggregate bi-directional link bandwidths (in GB/s)
– 3 levels, 1 POWER7/Hub chip: 4-2-8 D = 40 L = 80 Z = 160
– 3 levels, 2 POWER7s/Hub chip: 2-2-8 D = 80 L = 160 Z = 320
– 3 levels, 4 POWER7s/Hub chip: 4-1-8 D = 160 L = 320 Z = 640
– 2 levels, 1 POWER7/Hub chip: 64-32 D = 40 L = 80
– 2 levels, 2 POWER7s/Hub chip: 32-32 D = 80 L = 160
– 2 levels, 4 POWER7s/Hub chip: 16-32 D = 160 L = 320
– 2 levels, 4 POWER7s/Hub chip: 64-16 D = 160 L = 320
– 2 levels, 4 POWER7s/Hub chip: 4-64 D = 160 L = 320
– …
Design Trade-offs within the Topology
Too many Hub chips high cost
Too many links low point-to-point bw, high cost & power
Too much bandwidth per link high power
– 2 levels, 4 POWER7s/Hub chip: 16-32 D = 160 L = 320 design point
© 2010 IBM Corporation13
PERCS POWER7 Hierarchical Structure
POWER7 Chip– 8 Cores
POWER7 QCM & Hub Chips – QCM: 4 POWER7 Chips
•32 Core SMP Image– Hub Chip: One per QCM
•Interconnect QCM, Nodes, and Super Nodes– Hub Module: Hub Chip with Optics
POWER7 HPC Node– 2U Node – 8 QCMs, 8 Hub Chip Modules
•256 Cores
POWER7 ‘Super Node’ – Multiple Nodes per ‘Super Node’– Basic building block
Full System – Multiple ‘Super Nodes’
HubChip
Hub Module with Optics
© 2010 IBM Corporation14
Logical View of PERCS Interconnect
Supernode
Drawer
Quad POWER7 moduleHub chip
ISR(Integrated
Switch/Router)
PO
WE
R7
Coh
eren
cy B
us
POWER7chip
POWER7chip
POWER7 chip
POWER7chip
1
8
1
n
1
N .
..
full directconnectivity
betweenQuad Modules
via L-links
“local” L-links within Drawers
“remote” L-linksbetween Drawers
D-links
Full directConnectivity
between Supernodesvia D-links
.. .
7
24
16
1
1
1
Sm
pR
ou
ter
HFI(Host Fabric
Interface)
HFI(Host Fabric
Interface)
CAU(Collective
Acceleration Unit
...
© 2010 IBM Corporation15
System-Level D-Link Cabling Topology
Number of SuperNodes in system dictates D-link connection topology
– >256 SN Topology: 1 D-link interconnecting each SN-SN pair
– 256-SN Topology: 2 D-Links interconnecting each SN-SN pair
– 128-SN Topology: 4 D-links interconnecting each SN-SN pair
– 64-SN Topology: 8 D-links interconnecting each SN-SN pair
– 32-SN Topology: 16 D-links interconnecting each SN-SN pair
– 16-SN Topology: 32 D-links interconnecting each SN-SN pair
© 2010 IBM Corporation16
POWER7 Hub Chip
© 2010 IBM Corporation17
POWER7 Hub Chip Overview
Extends POWER7 capability for high performance cluster optimized systems
Replaces external switching and routing functions in prior networks
Low diameter Two-tier Direct graph network topology is used to interconnect tens of thousands of POWER7 8-core processor chips to dramatically improve bi-section bandwidth and latency
Highly Integrated Design– Integrated Switch/Router (ISR)– Integrated HCA (HFI)– Integrated MMU (NMMU)– Integrated PCIe channel controllers– Distributed function across the POWER7 and Hub chipset– Chip and optical interconnect on module
• Enables maximum packaging density
Hardware Acceleration of key functions– Collective Acceleration
•No CPU overhead at each intermediate stage of the spanning tree– Global Shared Memory
•No CPU overhead for remote atomic updates•No CPU overhead at each intermediate stage for small packet disaggregation/aggregation
– Virtual RDMA•No CPU overhead for address translation
P7 P7 P7 P7
Coherency Bus Ctl
HFIHub
HFIHub
Integrated Switch Router
CAU
© 2010 IBM Corporation18
POWER7 Hub Chip Block Diagram
POWER7 Coherency Bus
HFI HFINMMU
CAU
POWER7 LinkCtl
ISR
LR LinkCtl D LinkCtl
LL
Lin
kCtl PCIE 2.1
x16
PCIE 2.1 x16
PCIE 2.1 x8
24x 16x
24x 16x
24 LRemote – 10 Gb/s x 6b 16 D – 10 Gb/s x 12b
7 L
Lo
cal –
3 G
b/s
x 8
B
4 POWER7 Links – 3 Gb/s x 8B
3 PC
IE – 5 G
b/s x 40b
Inte
r-N
od
e C
on
ne
ct
Intra-SuperNode Connect Inter-SuperNode Connect
POWER7 QCM Connect
PC
IE C
on
ne
ct
192 GB/s
40 GB/s
336 GB/s
240 GB/s 320 GB/s
1.128 TB/s of off-chip interconnect bandwidth
© 2010 IBM Corporation19
Host Fabric Interface (HFI)and
Packet Flows
© 2010 IBM Corporation20
Host Fabric Interface (HFI) Features
Non-coherent interface between the POWER7 QCM and the ISR
– Four ramps/ports from each HFI to the ISR
Communication controlled through “windows”– Multiple supported per HFI
Address Translation provided by NMMU– HFI provides EA, LPID, Key, Protection Domain– Multiple page sizes supported
POWER7 Cache-based sourcing to HFI, injection from HFI
– HFI can extract produced data directly from processor cache
– HFI can inject incoming data directly into processor L3 cache
POWER7 Coherency Bus
HFI HFINMMU
CAU
POWER7 LinkCtl
ISR
LR LinkCtl D LinkCtl
LL
Lin
kCtl PCIE 2.1
x16
PCIE 2.1 x16
PCIE 2.1 x8
24x 16x
24x 16x
© 2010 IBM Corporation21
Host Fabric Interface (HFI) Features (cont’d)
Supports three APIs– Message Passing Interface (MPI)– Global Shared Memory (GSM)
• Support for active messaging in HFI (and POWER7 Memory Controller)– Internet Protocol (IP)
Supports five primary packet formats– Immediate Send
• ICSWX instruction for low latency– FIFO Send/Receive
• One to sixteen cache lines moved from local send FIFO to remote receive FIFO– IP
• IP to/from FIFO• IP with Scatter/Gather Descriptors
– GSM/RDMA• Hardware and software reliability modes
– Collective: Reduce, Multi-cast, Acknowledge, Retransmit
© 2010 IBM Corporation22
Host Fabric Interface (HFI) Features (cont’d)
GSM/RDMA Packet Formats– Full RDMA (memory to memory)
• Write, Read, Fence, Completion• Large message sizes with multiple packets per message
– Half-RDMA (memory to/from receive/send FIFO)• Write, Read, Completion• Single packet per message
– Small-RDMA (FIFO to memory) • Atomic updates• ADD, AND, OR, XOR, and Cmp & Swap with and without Data Fetch
– Remote Atomic Update (FIFO to memory)• Multiple independent remote atomic updates • ADD, AND, OR, XOR• Hardware guaranteed reliability mode
© 2010 IBM Corporation23
Window 0
Window n
HFI
USER
Window Context DetailReal (physical) memory
Window 2 send FIFO
Window 2 rcv FIFO
Segment table for
task
using window 2
Page table for
partition
using window 2
OS Hypervisor
Window 2
Window 1
HFI command count, …
Send FIFO address
Receive FIFO address
Epoch vector address
SDR1
Page table pointer
Job key
Process ID
LPAR ID
HFI Window Structure
© 2010 IBM Corporation24
POWER7 Link
ISR
POWER7 Coherency Bus
End-to-End Packet Flow
Proc,caches
POWER7 Link
...POWER7
Chip
HubChip
ISR Network
Proc,caches
POWER7 Coherency Bus
Mem
HFI
ISR
ISR ISR
.
.
HFI
POWER7 Chip
HubChip
Proc,caches ... Proc,
caches Mem
POWER7 Coherency Bus
POWER7 Coherency Bus
HFI HFI
© 2010 IBM Corporation25
Check Space in SFIFO
Calculate flits count
Get send FIFO slots
Build base header andmessage header
Recv FIFO(Cache/Memory)
Copy data to send FIFO
read_data
ISRNETWORK
Ring HFI doorbell
Source Node Destination Node
Packet is valid
Copy from recv FIFO to user’s buffer
yes
User callback to consume the data
HFI FIFO Send/Receive Packet Flow
Send FIFO(Cache/Memory)
HFI HFI
dma_read
dma_write of packet
Single or Multiple packets per doorbell
Packets processed in FIFO order
Packet sizes up to 2KB
Bandwidth optimized
© 2010 IBM Corporation26
Recv FIFO(Cache/Memory)
push of data
ISRNETWORK
Source Node Destination Node
Packet is valid
Copy from recv FIFO to user’s buffer
yes
User callback to consume the data
HFI Immediate Send/Receive Packet Flow
Cache
HFI HFIdma_write of packet
Check for available HFI Buffers
Build base header andmessage header in Cache
Execute ICSWX instruction
Single cache line packet size
Latency optimized
© 2010 IBM Corporation27
Check Space in RDMA CMD FIFO
Build RDMA base header in RDMA CMD FIFO
Memory
ISRNETWORK
Ring HFI doorbell
Source Node Destination Node
Generate initiator and/or remote completion
notifications to receiveFIFOs for last packet
HFI GSM/RDMA Message Flow (RDMA Write)
RDMA CMD FIFO(Cache/Memory)
HFI HFI
dma_read
dma_write of packet
Memory
Task EASpace
RDMA_HDR
dma_readRDMA
Payload
Task EASpace
Single or Multiple RDMA messages per doorbell
Send-side HFI breaks a large message into multiple packets of up to 2KB each
Large RDMA messages are interleaved with smaller RDMA messages to prevent HOL blocking in the RDMA CMD FIFO
Initiator notification packet traverses network in opposite direction from data (not shown)
User call back to consume notification
© 2010 IBM Corporation28
Check Space in SFIFO
Calculate flits count
Get send FIFO slots
Build base header and
message header
Copy data to send FIFO
read_data
ISRNETWORK
Ring HFI doorbell
Source Node Destination Node
HFI GSM Small RDMA (Atomic) Update Flow
Send FIFO(Cache/Memory)
HFI HFI
dma_read
dma_write of packet
Single or Multiple packets per doorbell
Packets processed in FIFO order
Single cache line packet size
Initiator notification packet traverses network in opposite direction from data (not shown)
Memory
Generate initiator and/or remote completion
notifications to receive FIFOs with or
without Fetch Data
Task EASpace
Atomic RMW
User call back to consume notification
© 2010 IBM Corporation29
Check Space in SFIFO
Calculate flits count
Get send FIFO slots
Build base header and
message header
Copy data to send FIFO
read_data
ISRNETWORK
Ring HFI doorbell
Source Node Destination Node
HFI GSM Remote Atomic Update Flow
Send FIFO(Cache/Memory)
HFI HFI
dma_read
dma_write of packet
Single or Multiple packets per doorbell
Packets processed in FIFO order
Single cache line packet size
Single or multiple remote atomic updates per packet
No notification packets
Assumes Hardware reliability
Memory
Update packet received indicated
count register
Task EASpace
Atomic RMW
Update packet sent indicated count register
© 2010 IBM Corporation30
Integrated Switch Router (ISR)and
Routing
© 2010 IBM Corporation31
Integrated Switch Router (ISR) Features
Two tier, full graph network
3.0 GHz internal 56x56 crossbar switch– 8 HFI, 7 LL, 24 LR, 16 D, and SRV ports
Virtual channels for deadlock prevention
Input/Output Buffering
2 KB maximum packet size– 128B FLIT size
Link Reliability– CRC based link-level retry– Lane steering for failed links
IP Multicast Support– Multicast route tables per ISR for replicating and forwarding multicast packets
Global Counter Support– ISR compensates for link latencies as counter information is propagated – HW synchronization with Network Management setup and maintenance
POWER7 Coherency Bus
HFI HFINMMU
CAU
POWER7 LinkCtl
ISR
LR LinkCtl D LinkCtl
LL
Lin
kCtl PCIE 2.1
x16
PCIE 2.1 x16
PCIE 2.1 x8
24x 16x
24x 16x
© 2010 IBM Corporation32
Integrated Switch Router (ISR) - Routing
Packet’s View: Distributed Source Routing– The paths taken by packets is deterministic direct routing– The packets are injected with the desired destination indicated in the header– Partial Routes are picked up in the ISR at each hop of the path
Routing Characteristics– 3-hop L-D-L longest direct route– 5-hop L-D-L-D-L longest indirect route– Cut-through Wormhole routing– Full hardware routing using distributed route tables across the ISRs
• Source route tables for packets injected by the HFI• Port route tables for packets at each hop in the network• Separate tables for inter-supernode and intra-supernode routes
– FLITs of a packet arrive in order, packets of a message can arrive out of order
© 2010 IBM Corporation33
Integrated Switch Router (ISR) – Routing (cont’d)
Routing Modes– Hardware Single Direct Routing– Hardware Multiple Direct Routing
• For less than full-up system where more than one direct path exists– Hardware Indirect Routing for data striping and failover
• Round-Robin, Random– Software controlled indirect routing through hardware route tables
Route Mode Selection– Dynamic network information provided to upper layers of the stack to select route mode– Decision on route can be made at any layer and percolated down– Route mode is placed into the packet header when a packet is injected into the ISR
© 2010 IBM Corporation34
SuperNode
QCM: 4 chip POWER7 processor module
7 Llocal Links to Hubs within the drawer
POWER7 IH Drawer with 8 octants (QCM-Hub chip pairs)
Hub chip withIntegrated Switch Router (ISR)
ISR
24 Lremote Links to Hubs in other three drawers of the supernode
16 D links to other supernodes
© 2010 IBM Corporation35
SN-SN Direct Routing
SuperNode A
Direct Routing
SuperNode B
L-Hops: 2D-Hops: 1
- Max Bisection BW- One to many paths depending on system size
Total: 3 hops
D3
LR12
LR21
© 2010 IBM Corporation36
SN-SN Indirect Routing
SuperNode A SuperNode B
SuperNode x
L-Hops: 3D-Hops: 2
Indirect Routing
Total: 5 hops
D5
D12
LR21
LL7
LR30
Total paths = # SNs – 2
© 2010 IBM Corporation37
Integrated Switch Router (ISR) – Network Management
Network Management software initializes and monitors the interconnect– Central Network Manager (CNM) runs on the Executive Management Server (EMS)– Local Network Management Controllers (LNMC) run on the service processors in each
drawer
Cluster network configuration and verification tools for use at cluster installation
Initializes the ISR links and sets up the route tables for data transmission
Runtime monitoring of network hardware– Adjusts data paths to circumvent faults and calls out network problems
Collects state and performance data from the HFI & ISR on the hubs during runtime
Configures and maintains Global Counter master
© 2010 IBM Corporation38
Collectives Acceleration Unit
© 2010 IBM Corporation39
Collectives Acceleration Unit (CAU) Features
Operations– Reduce: NOP, SUM, MIN, MAX, OR, AND, XOR– Multicast
Operand Sizes and Formats– Single Precision and Double Precision – Signed and Unsigned– Fixed Point and Floating Point
Extended Coverage with Software Aid– Types: barrier, all-reduce– Reduce ops: MIN_LOC, MAX_LOC, (floating point) PROD
Tree Topology– Multiple entry CAM per CAU: supports multiple independent trees– Multiple neighbors per CAU: each neighbor can be either a local or remote CAU/HFI– Each tree has one and only one participating HFI window on any involved node– It’s up to the software to setup the topology
POWER7 Coherency Bus
HFI HFINMMU
CAU
POWER7 LinkCtl
ISR
LR LinkCtl D LinkCtl
LL
Lin
kCtl PCIE 2.1
x16
PCIE 2.1 x16
PCIE 2.1 x8
24x 16x
24x 16x
© 2010 IBM Corporation40
Collectives Acceleration Unit (CAU) Features (cont’d)
Sequence Numbers for Reliability and Pipelining– Software driven retransmission protocol if credit return is delayed– The previous collective is saved on CAU for retransmission
• Allows retransmit from point of failure vs. restart of entire operation
Reproducibility – Binary trees for reproducibility– Wider trees for better performance
© 2010 IBM Corporation41
CAU: Example of Trees
CAUHFI
CAUHFI
CAUHFI
CAUHFI
Node 0
Node 1
Node 2
Node 3
CAUHFI
CAUHFI
CAUHFI
CAUHFI
Node 4
Node 5
Node 6
Node 7
Tree A: node 0, 1, 2, 3
Tree B: node 0, 5
Tree C: node 4, 5, 6, 7
Tree D: node 3, 7
A C
B
D
© 2010 IBM Corporation42
CAU: Operations on A Tree
CAUHFI
CAUHFI
CAUHFI
CAUHFI
Broadcast
Node 4
Node 5
Node 6
Node 7
Root
CAUHFI
CAUHFI
CAUHFI
CAUHFI
Reduce
Node 4
Node 5
Node 6
Node 7
Root
© 2010 IBM Corporation43
Chip and Module Metrics
© 2010 IBM Corporation44
POWER7 Hub Chip
45 nm lithography, Cu, SOI– 13 levels metal– 440M transistors
582 mm2
– 26.7 mm x 21.8 mm– 3707 signal I/O– 11,328 total I/O
61 mm x 96 mm Glass Ceramic LGA module– 56 – 12X optical modules
• LGA attach onto substrate
1.128 TB/s interconnect bandwidth
D LinkPhys
LR & DLinkPhys
ISR
HFI HFINMMU
PCIE PCIE PCIE
PCIE Phys
CAU
LLocal Link Phys
POWER7 Coherency Bus
PO
WE
R7
Co
he
ren
cy
Bu
s
POWER7 Phys
PO
WE
R7
Ph
ys
© 2010 IBM Corporation45
Off-chip Interconnect and PLL’s (4) 8B W,X,Y,Z Interconnect Bus to POWER7
– 3.0 Gb/s Single Ended EI-3 → 192 GB/s192 GB/s throughput
(7) 8B L-Local (LL) Interconnect Busses to Hub chips– 3.0 Gb/s Single Ended EI-3 → 336 GB/s336 GB/s throughput
• Shared physical transport for Cluster Interconnect andPOWER7 Coherency Bus protocol
(24) L-Remote: LR[0..23] within Drawer Hub Optical Interconnect Bus– 6b @ 10Gb/s Differential, 8/10 encoded → 240GB/s240GB/s throughput
(16) D[0..15] between Drawer Hub Optical Interconnect Bus– 10b @ 10Gb/s Differential 8/10 encoded → 320 GB/s320 GB/s throughput
PCI General Purpose I/O → 40GB/s40GB/s throughput
24 total PLL’s – (3) “Core” PLLs
• (1) Internal Logic • (1) W,X,Y,Z EI3 buses • (1) LL0 – LL6 EI3 buses– (2) ”Intermediate Frequency “IF LC Tank” PLLs
• (1) Optical LR buses • (1) Optical D buses– (14) High Frequency “HF LC Tank” PLLs
• (6) Optical LR buses • (8) Optical D buses– (5) PCI-E “Combo PHY” PLLs
1.128 TB/s Total 1.128 TB/s Total Hub I/OHub I/O
BandwidthBandwidth
© 2010 IBM Corporation46
LGA Interposer
Optical Modules &
Fiber Ribbons
Hub Module Overview
POWER7Hub Chip
D-link Optical
Module Sites
LR-link Optical Module Sites
Module Assembly
POWER7 Hub Chip
Heat Spreader
Attribute Definition
Technology High Performance Glass Ceramic LGA
Body Size 61mm x 95.5mm
LGA Grid Depopulated 58 x 89
Layer Count 90
Module BSM I/O 5139
© 2010 IBM Corporation47
Summary
© 2010 IBM Corporation48
Key System Benefits Enabled by the PERCS Interconnect
A PetaScale System with Global Shared Memory
Dramatic improvement in Performance and Sustained Performance – Scale Out Application Performance (Sockets, MPI, GSM)– Ultra-low latency, enormous bandwidth and very low CPU utilization
Elimination of the Traditional Infrastructure– HPC Network: No HCAs and External Switches, 50% less PHYs and cables of equivalent fat-tree
structure with the same bisection bandwidth– Storage: No FCS HBAs, External Switches, Storage Controllers, DASD within the compute node– I/O: No External PCI-Express Controllers
Dramatic cost reduction– Reduce the overall Bill of Material (BOM) costs in the System
A step function improvement in Data Center reliability– Compared to commodity clusters with external storage controllers, routers/switches, etc.
Full virtualization of all hardware in the data center
Robust end-to-end systems management
Dramatic reduction in Data Center power compared to commodity clusters
© 2010 IBM Corporation49
Acknowledgements
Authors– Baba Arimilli, Ravi Arimilli, Robert Blackmore, Vicente Chung , Scott Clark, Wolfgang
Denzel, Ben Drerup, Torsten Hoefler, Jody Joyner, Jerry Lewis , Jian Li, Nan Ni, Ram Rajamony, Aruna Ramanan and Hanhong Xue
This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.