Upload
noelia-easlick
View
219
Download
4
Embed Size (px)
Citation preview
The Alpha RoadmapThe Alpha RoadmapHow it applies to Alpha clustersHow it applies to Alpha clusters
Ray HookwayRay Hookway
Compaq Computer CorporationCompaq Computer Corporation
Littleton, MALittleton, MA
[email protected]@compaq.com
Map FeaturesMap Features
Alpha Processor RoadmapAlpha Processor Roadmap Alpha SystemsAlpha Systems Alpha ClustersAlpha Clusters
Processor RoadmapProcessor Roadmap
References:References: Pete Bannon, “Alpha 21364: A Scalable Single-Pete Bannon, “Alpha 21364: A Scalable Single-
chip SMP”, Microprocessor Forum 1998, chip SMP”, Microprocessor Forum 1998, http://www.digital.com/alphaoem/microprocessorfohttp://www.digital.com/alphaoem/microprocessorforum.htmrum.htm
Joel Emer, “Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance”, Microprocessor Multiplying Alpha Performance”, Microprocessor Forum 1999Forum 1999
Alpha RoadmapAlpha RoadmapHigher Performance
Lo
wer
Co
st
2000 2001 2002 20031998 1999
EV6 EV6 2126421264
EV68EV68
0.35m
EV67EV670.28m
0.18m
EV7EV70.18m
...
EV8EV80.125m
EV78EV780.125m
Alpha 21264 is performance leaderAlpha 21264 is performance leader
0
10
20
30
40
50
60
CompaqAlphaServer DS20
HP D380SunUE250
SPECfp9558.7
17.417.420.620.612.612.6
IBM F50
Alpha 21264 SystemsAlpha 21264 Systems AlphaServer 8400 with EV6/575AlphaServer 8400 with EV6/575
Benchmark CPU EV6/575 EV56/600 RatioSPECint95 1 30.3 18.4 1.6SPECfp95 1 47.7 20.8 2.2Linpack 100x100 1 460 280 1.5TPC-C (K/Min)* 8 37.5 24.5 1.5AIM 7 max users (K) 8 10.5 6.9 1.4SPECweb (K conn/sec)** 8 12.2 7.8 1.5*37,541 tpmC at $79.4/tpmC for 8CPU 16GB Sybase V11.9 available 12/98
**estimated
IA-64 .vs. Alpha PhilosophyIA-64 .vs. Alpha Philosophy
EPICEPIC Smart compiler and a dumb Smart compiler and a dumb
machine machine Compiler creates record of Compiler creates record of
executionexecution Machine plays recordMachine plays record
Stall when compiler is wrongStall when compiler is wrong
Focus on vector programsFocus on vector programs Compiler transform scalar to Compiler transform scalar to
vectorvector What about:What about:
function calls, indirectionfunction calls, indirection dynamic linkingdynamic linking C++, Java/JITC++, Java/JIT
ALPHAALPHA Smart compiler, smart machine, and Smart compiler, smart machine, and
a GREAT circuit designa GREAT circuit design Compiler creates record of Compiler creates record of
executionexecution Machine exploits additional Machine exploits additional
information available at runtimeinformation available at runtime Works across barriers to compile-Works across barriers to compile-
time analysistime analysis Focus on scalar programs Focus on scalar programs
Add resources for vectorAdd resources for vector Amdahl’s lawAmdahl’s law
Alpha 21364 GoalsAlpha 21364 Goals
ImproveImprove Single processor performance, operating frequency, Single processor performance, operating frequency,
and memory systemand memory system SMP scalingSMP scaling System performance density (computes/ftSystem performance density (computes/ft33)) Reliability and availabilityReliability and availability
DecreaseDecrease System costSystem cost System complexitySystem complexity
““It’s the Memory, Stupid”It’s the Memory, Stupid”
Dick Sites
Estimated time for TPC-CEstimated time for TPC-C
0102030405060708090
100 IssueMispredTrapCacheMemory
New
core
Hig
her
MH
z
Hig
her
int e
gr a
t ion
Alpha 21364 FeaturesAlpha 21364 Features
Alpha 21264 core with enhancementsAlpha 21264 core with enhancements Integrated L2 CacheIntegrated L2 Cache Integrated memory controllerIntegrated memory controller Integrated network interfaceIntegrated network interface Support for lock-step operation to enable high-Support for lock-step operation to enable high-
availability systems.availability systems.
MemoryController
RAMBUS
21364 Chip Block Diagram21364 Chip Block Diagram
21264Core
16 L1Miss Buffers
L2Cache
Address Out
Address In
NetworkInterface
NSEWI/O
16 L1Victim Buf 16 L2
Victim Buf
64K Icache
64K Dcache
Int RegMap
Branch Predictors
21364 Core21364 Core
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
L2 cache1.5MB6-Set
Int Issue Queue
(20)
Exec
4 Instructions / cycle
Reg File(80)
Victim Buffer
L1 DataCache64KB2-Set
FP Reg Map
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
Miss Address
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecReg File(80)
FP Issue Queue
(15)
Reg File(72)
Integrated L2 CacheIntegrated L2 Cache
1.5 MB1.5 MB 6-way set associative6-way set associative 16 GB/s total read/write bandwidth16 GB/s total read/write bandwidth 16 Victim buffers for L1 -> L216 Victim buffers for L1 -> L2 16 Victim buffers for L2 -> Memory16 Victim buffers for L2 -> Memory ECC SECDED codeECC SECDED code 12ns load to use latency12ns load to use latency
Integrated Memory ControllerIntegrated Memory Controller
Direct RAMbusDirect RAMbus High data capacity per pinHigh data capacity per pin 800 MHz operation800 MHz operation 30ns CAS latency pin to pin30ns CAS latency pin to pin
6 GB/sec read or write bandwidth6 GB/sec read or write bandwidth 100s of open pages100s of open pages Directory based cache coherenceDirectory based cache coherence ECC SECDED ECC SECDED
Integrated Network InterfaceIntegrated Network Interface
Direct processor-to-processor interconnectDirect processor-to-processor interconnect 10 GB/second per processor10 GB/second per processor 15ns processor-to-processor latency15ns processor-to-processor latency Out-of-order network with adaptive routingOut-of-order network with adaptive routing Asynchronous clocking between processorsAsynchronous clocking between processors 3 GB/second I/O interface per processor3 GB/second I/O interface per processor
21364 System Block Diagram21364 System Block Diagram
364M
IO364
M
IO364
M
IO364
M
IO
364M
IO364
M
IO364
M
IO364
M
IO
364M
IO364
M
IO364
M
IO364
M
IO
Alpha 21364 TechnologyAlpha 21364 Technology
0.18 0.18 m CMOSm CMOS 1000+ MHz1000+ MHz 100 Watts @ 1.5 volts100 Watts @ 1.5 volts 3.5 cm3.5 cm22
6 Layer Metal6 Layer Metal 100 million transistors100 million transistors
8 million logic8 million logic 92 million RAM92 million RAM
Alpha 21364 Performance/StatusAlpha 21364 Performance/Status
70 SPECint95 (estimated)70 SPECint95 (estimated) 140 SPECfp95 (estimated)140 SPECfp95 (estimated) RTL model runningRTL model running Tapeout 4Q99Tapeout 4Q99
21364 Summary21364 Summary
The 21364 integrated L2 cache and memory The 21364 integrated L2 cache and memory controller provide outstanding single processor controller provide outstanding single processor performanceperformance
The 21364 integrated network interface enables The 21364 integrated network interface enables high performance multi-processor systemshigh performance multi-processor systems
The high level of integration directly supports The high level of integration directly supports systems containing a large number of processorssystems containing a large number of processors
21464 Overview21464 Overview
Enhanced out-of-order executionEnhanced out-of-order execution 8-wide superscalar8-wide superscalar Large on-chip L2 cacheLarge on-chip L2 cache Direct RAMBUS interfaceDirect RAMBUS interface On-chip router for system interconnect On-chip router for system interconnect Glueless, directory-based, ccNUMA for up to 512-Glueless, directory-based, ccNUMA for up to 512-
way SMPway SMP 4-way simultaneous multithreading (SMT)4-way simultaneous multithreading (SMT)
Superscalar Instruction IssueSuperscalar Instruction Issue
Time
Multi-ThreadingMulti-Threading
Time
Simultaneous Multi-ThreadingSimultaneous Multi-Threading
Time
What Changed?What Changed?
Multiple Program CountersMultiple Program Counters Choose among themChoose among them
More Architectural Register SpaceMore Architectural Register Space MapperMapper Register FilesRegister Files
Distinguished Per Thread Instruction StateDistinguished Per Thread Instruction State Register mappingRegister mapping Instruction RetireInstruction Retire Store BuffersStore Buffers Abort and Restart InformationAbort and Restart Information
What Didn’t ChangeWhat Didn’t Change
Almost everything elseAlmost everything else
No basic functional changes in any stageNo basic functional changes in any stage No partitioned instruction cacheNo partitioned instruction cache No partitioned data cachesNo partitioned data caches No partitioned off-chip cachesNo partitioned off-chip caches No extra register filesNo extra register files Little special branch prediction mechanismLittle special branch prediction mechanism
Multi-threaded ScalingMulti-threaded Scaling
0
1
2
3
4
5
6
SPEC95 Integer SPEC95Integer/FP
SPEC95 FloatingPoint
SQL Server
Average IPC of 4 single-threaded programsIPC of 4 programs running multithreaded
1.8x 2.0x
1.9x
2.3x
AlphaServer DS Series• Uni and dual processor systems• Offerings scale to 8GB memory• Up to 6 PCI slots
Switched based system - 64-bit PCI I/O subsystems - Very Large Memory
Scalable clusters on DIGITAL UNIX, OpenVMS
Modular system packaging - advanced systems management
•1-64Processors•Up to 128GB of
memory•Up to 224 PCI slots
•Up to 32GB of memory•1- 4 Processors•Up to 10 PCI slots
AlphaServer ES Series
AlphaServer GS Series
AlphaServerAlphaServer Family Today Family Today
AlphaServer DS10AlphaServer DS10
Fast Memory Access Large total RAM -128MB up to 1GB High bandwidth access - 1.3 GB/s
Flexible Internal Storage Internal dual channel IDE storage included Optional SCSI adapter supported 3 internal disk bays
Special Features 4 PCI I/O slots (3 64-bit, 1 32-bit) 300 watt power supply 3U Small footprint - Rack or Desktop Dual embedded 10/100 Ethernet ports
New New AlphaServer AlphaServer DS SeriesDS Series
Solution for project environmentSolution for project environment Fastest Uni processor design Fastest Uni processor design in a 1U formfactorin a 1U formfactor
Fastest Memory Access with the Highest Fastest Memory Access with the Highest Bandwidth memory in its classBandwidth memory in its class
High speed I/O with 64 bit PCI High speed I/O with 64 bit PCI Sleek, compact and powerful package Sleek, compact and powerful package Dual Purpose Solutions SupportDual Purpose Solutions Support
Rack Rack andand desktop-ready for space constrained desktop-ready for space constrained environmentsenvironments
AlphaServer DS SeriesAlphaServer DS Series Fast Memory AccessFast Memory Access
Large total RAM - 64MB up to 1GBLarge total RAM - 64MB up to 1GB High bandwidth access - 1.3 GB/s High bandwidth access - 1.3 GB/s
Flexible Internal StorageFlexible Internal Storage Internal dual channel IDEInternal dual channel IDE Wide range of PCI Options supportedWide range of PCI Options supported 2 disk bays-27 GB IDE / 18 GB SCSI 2 disk bays-27 GB IDE / 18 GB SCSI
Special FeaturesSpecial Features Optional Slimline CD-Floppy Combo Optional Slimline CD-Floppy Combo Toolless features-snap out CD and DiskToolless features-snap out CD and Disk Full 1 PCI I/O slot (64-bit)Full 1 PCI I/O slot (64-bit) 150 watt power supply150 watt power supply 1U (1.75”) Small footprint – Rack or Desktop1U (1.75”) Small footprint – Rack or Desktop Dual embedded 10/100b Ethernet portsDual embedded 10/100b Ethernet ports
Performance and Management FeaturesPerformance and Management Features Remote management consoleRemote management console Serverworks and Compaq Insight ManagerServerworks and Compaq Insight Manager
New AlphaServer DS SeriesNew AlphaServer DS Series
Complementary, low-cost, Complementary, low-cost, open source model.open source model.
Leadership performance Leadership performance over other Linux platforms.over other Linux platforms.
Tru64 UNIX compatibility Tru64 UNIX compatibility with common SWD toolswith common SWD tools
Support services through Support services through Compaq and partnersCompaq and partners
Two ways to build an Alpha clusterTwo ways to build an Alpha cluster
Scalable, robust HPC Scalable, robust HPC platformplatform
Maximum performance over Maximum performance over broadest range of broadest range of applicationsapplications
Outstanding system Outstanding system management and reliability management and reliability featuresfeatures
Sierra Beowulf
Sierra ArchitectureSierra Architecture
Tera-scale systems derived from ASCI PathForwardTera-scale systems derived from ASCI PathForward
Very large Distributed Shared Memory systemsVery large Distributed Shared Memory systems
High speed, scalable interconnect (Quadrics)High speed, scalable interconnect (Quadrics)
Exploit EV6, EV7 & EV8Exploit EV6, EV7 & EV8
Installed and administered as single systemInstalled and administered as single system
System wide schedulerSystem wide scheduler
High performance file systems (PFS, CFS, AdvFS)High performance file systems (PFS, CFS, AdvFS)
Application availabilityApplication availability
Sierra – ASCI Pathforward ProjectSierra – ASCI Pathforward Project
Alpha Beowulf ClustersAlpha Beowulf Clusters
Compaq ships 64-bit Linux on Alpha systemsCompaq ships 64-bit Linux on Alpha systems Myrinet and other popular interconnects are Myrinet and other popular interconnects are
supportedsupported SeverNet-II available in late 1999SeverNet-II available in late 1999 Compaq Tru64Unix (Digital Unix) development Compaq Tru64Unix (Digital Unix) development
tools ported in 1999 (!)tools ported in 1999 (!)
Prepackaged Beowulf ClusterPrepackaged Beowulf Cluster H9A 15 C abine t71.75'' 41U
78.74''
1.75'' 1U H7600-AA L5-30P Input
1.75'' 1U H7600-AA L5-30P Input
3.50" U2 M yricom
SD
1 2 3 4 5 6 7 8DECserver 90M
SD
1 2 3 4 5 6
7 8 9 10
11
12
Eth
ern
et
AC
BD
B1x 2x 3x 4x 5x 6x 7x 8x
D
1x 2x 3x 4x 5x 6x 7x 8x
A1x 2x 3x 4x 5x 6x 7x 8x
C
1x 2x 3x 4x 5x 6x 7x 8x
1 2 3 4 5 6 7 8
PORTswitch 900TP/12
Hub or switch forlocal network traff icand control.
Term inal serverattached to eachconsole port toprov ide singleconsole control.
Myrinet High speedlow latency network.
VT420
Contr as t Br igh t
d i g i t a l
Root Node DS10
DS105.25'' 3U
COM PA Q
DS105.25'' 3U
COM PA Q
DS105.25'' 3U
COM PA Q
DS105.25'' 3U
COM PA Q
DS105.25'' 3U
COM PA Q
DS105.25'' 3U
COM PA Q
DS105.25'' 3U
COM PA Q
DS105.25'' 3U
COM PA Q
7.00'' 4UStorageWorks BA356
OP
EN
OP
EN
OP
EN
OP
EN
OP
EN
OP
EN
OP
EN
7.00'' 4UStorageWorks BA356
OP
EN
OP
EN
OP
EN
OP
EN
OP
EN
OP
EN
OP
EN
Platform 3
FlatpanelMonitor
O pen Space for Air F low Fan1.75'' 1U CT-D10MJ-SR
Starter DS10-based Beowulf cluster, including eight Alphaserver DS10 compute nodes, one Alphaserver DS10 management station with keyboard/trackball and display, Myrinet™ system area network, 73.1 GB JBOD UltraSCSI disk storage, Ethernet multiplexer for system management, and all Linux software required for basic Beowulf operation.
ServerNet-II InterconnectServerNet-II Interconnect
Scalable high-performance network.Scalable high-performance network. 65,536 end nodes, 5 km range.65,536 end nodes, 5 km range. Multi-gigabit, low latency, low CPU, cheap.Multi-gigabit, low latency, low CPU, cheap. VIA - Virtual Interface Architecture.VIA - Virtual Interface Architecture. MPI - Message Passing Interface.MPI - Message Passing Interface. Open source Intel and Alpha Linux drivers.Open source Intel and Alpha Linux drivers. NT, Tru64, NonStop Clusters, VxWorks.NT, Tru64, NonStop Clusters, VxWorks.
Virtual Interface Architecture Virtual Interface Architecture (VIA)(VIA)
ApplicationsApplications
VI Primitive LibraryVI Primitive LibraryOpen/Close/Map Memory Send/Receive/Read/Write Open/Close/Map Memory Send/Receive/Read/Write
VI Kernel SupportVI Kernel Support
VI Kernel HW InterfaceVI Kernel HW Interface
SAN Media Interface (ServerNet, Ethernet, ...)SAN Media Interface (ServerNet, Ethernet, ...)
OS Vendor APIOS Vendor API
DBMSDBMSAppsApps
CQCQ VIVI VIVI VIVI
65,536 VIs per node.65,536 VIs per node. RDMA & send/recv.RDMA & send/recv. Reliable reception.Reliable reception. < 2% CPU utilization.< 2% CPU utilization. Low latency/zero copy.Low latency/zero copy. Thread-safe, protected.Thread-safe, protected. Basis for Basis for COMPAQCOMPAQ’s ’s
“System I/O”.“System I/O”.
Communication through “Virtual Interfaces (VI)”Communication through “Virtual Interfaces (VI)”with associated “Completion Queues (CQ)”.with associated “Completion Queues (CQ)”.
ServerNet-II ComponentsServerNet-II Components
Beowulf.loc1.Tandem.comBeowulf.loc1.Tandem.com
ServerNet-II Hardware ServerNet-II Hardware ComponentsComponents
Router II
FCAL bridge, dual line card, or LAN bridge
Line Card
Line Card
Line Card
Line Card
Line Card
Line Card
Line Card
Line Card
IBC Logic
FCAL bridge, dual line card, or LAN bridge
Dual-port PCI interface (NIC)Dual-port PCI interface (NIC)– VIA in hardware, DCEVIA in hardware, DCE– negligible CPU costnegligible CPU cost– 64 bit, 33 MHz & 66 MHz64 bit, 33 MHz & 66 MHz
12 port crossbar switch12 port crossbar switch– wormhole routedwormhole routed– < 300 nsec latency< 300 nsec latency– ““fat pipe” channel bondingfat pipe” channel bonding– bridges to fibre channel, bridges to fibre channel,
gigabit ethernetgigabit ethernet Gigabit ethernet cables Gigabit ethernet cables
– copper or fibre opticcopper or fibre optic– 5 meters to 5 km5 meters to 5 km
ServerNet-II Hardware ServerNet-II Hardware PerformancePerformance
1.25+1.25 gigabit/s links 1999, doubles 2001.1.25+1.25 gigabit/s links 1999, doubles 2001. < 300 nanosecond path formation per stage.< 300 nanosecond path formation per stage. 1M end nodes, 5 km fibre optic links.1M end nodes, 5 km fibre optic links.
Single VISingle VI Multiple VIsMultiple VIs33 MHz PCI-6433 MHz PCI-64 166 MB/s166 MB/s 240 MB/s240 MB/s66 MHz PCI-6466 MHz PCI-64 197 MB/s197 MB/s 350 MB/s350 MB/s
ReliabilityReliability
Dual port NICs, dual network topologies.Dual port NICs, dual network topologies. Link level CRC, in-band control protocol.Link level CRC, in-band control protocol. Strong packet ordering guarantees. Strong packet ordering guarantees. Every packet is acknowledged by receiver.Every packet is acknowledged by receiver. Automatic retry on transmission failure.Automatic retry on transmission failure. Avoids deadlock & livelock.Avoids deadlock & livelock.
Linux Software Development Linux Software Development Tools for Tools for AlphaServers AlphaServers
Boosted performance with Boosted performance with Compaq Portable Math Library Compaq Portable Math Library (CPML) for Linux on Alpha(CPML) for Linux on Alpha
Significantly increases the Significantly increases the precision and speed of precision and speed of mathematical calculations up to mathematical calculations up to 10 times compared to other 10 times compared to other mathematical libraries currently mathematical libraries currently available on Linuxavailable on Linux
Following the success of the Following the success of the Portable Math Library, now Portable Math Library, now announcing plans for Compaq announcing plans for Compaq Extended Math Library to run Extended Math Library to run on Linux AlphaServer systemson Linux AlphaServer systems
Compaq C compiler announced in Compaq C compiler announced in AprilApril
Compaq Fortran compilers for Linux Compaq Fortran compilers for Linux announced in April available in July announced in April available in July beta programbeta program
New Compaq C++ compilerNew Compaq C++ compiler Makes it easy to support both Linux and Makes it easy to support both Linux and
Tru64 UNIXTru64 UNIX software software New Software Development Test-New Software Development Test-
Drive capabilitiesDrive capabilities Test out the performance of your Test out the performance of your
Application over the WebApplication over the Web Get help from our leading Linux Get help from our leading Linux
developers to optimize your application developers to optimize your application
SPEC CPU Benchmark* SPEC CPU Benchmark*
* not audited
Linpack (100x100) MFlopsLinpack (100x100) MFlops
0
50
100
150
200
250
P2/300 P2/450 EV56/500 EV6/500
gccGEM
SummarySummary
Alpha is the fastest processor availableAlpha is the fastest processor available Alpha is available in a full range of high Alpha is available in a full range of high
performance systemsperformance systems Sierra systems provided complete tera-scale Sierra systems provided complete tera-scale
solutionssolutions Compaq wants to be involved in the Beowulf Compaq wants to be involved in the Beowulf
communitycommunity