Upload
rhiannon-fritter
View
218
Download
4
Embed Size (px)
Citation preview
Scalable Scientific Computing at CompaqScalable Scientific Computing at Compaq
CAS 2001
Annecy, France
October 29 – November 1, 2001
Dr. Martin Walker
Compaq Computer EMEA
CAS 2001
Annecy, France
October 29 – November 1, 2001
Dr. Martin Walker
Compaq Computer EMEA
Agenda of the entertainmentAgenda of the entertainmentAgenda of the entertainmentAgenda of the entertainment
From EV4 to EV7: four implementations of the Alpha microprocessor over ten years
Performance on a few applications, including numerical weather forecasting
The Terascale Computing System at the Pittsburgh Supercomputing Center
Marvel: the next (and last) AlphaServer Grid Computing
Scientific basis for vector processor Scientific basis for vector processor choice for Earth Simulator projectchoice for Earth Simulator projectScientific basis for vector processor Scientific basis for vector processor choice for Earth Simulator projectchoice for Earth Simulator project
Comparison of Cray T3D and Cray Y-MP/C90 J.J. Hack, et al, “Computational design of the NCAR community climate model”, J. Parallel Computing 21 (1995) 1545-1569
Fraction of peak performance achieved– 1-7% on Cray T3D
– 30% on Cray Y-MP/C90 Cray T3D used the Alpha EV4 processor from
1992
Key ratios that determine sustained application Key ratios that determine sustained application performance (U.S. DoD/DoE)performance (U.S. DoD/DoE)Key ratios that determine sustained application Key ratios that determine sustained application performance (U.S. DoD/DoE)performance (U.S. DoD/DoE)
Int RegMap
Branch Predictors
Alpha EV6 ArchitectureAlpha EV6 ArchitectureAlpha EV6 ArchitectureAlpha EV6 Architecture
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
Int Issue Queue
(20)
Exec
4 Instructions / cycle
Reg File(80)
Victim Buffer
L1 DataCache64KB2-Set
FP Reg Map
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
Miss Address
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecReg File(80)
FP Issue Queue
(15)
Reg File(72)
Weather Forecasting BenchmarkWeather Forecasting BenchmarkWeather Forecasting BenchmarkWeather Forecasting Benchmark LM = local model, German Weather Service (DWD)
Current version is RAPS 2.0 Grid size is 325 325 35; predefined INPUT set dwd
used for all benchmarks First forecast hour timed (contains more I/O than
subsequent forecast hours) Machines
– Cray T3E/1200 (EV5/600 MHz) in Jülich, Germany– AlphaServer SC40 (EV67/667 MHz) in Marlboro, MA
Study performed by Pallas GmbH (www.pallas.com)
Total time (AS SC40 vs. Cray T3E)Total time (AS SC40 vs. Cray T3E)Total time (AS SC40 vs. Cray T3E)Total time (AS SC40 vs. Cray T3E)
0
100
200
300
400
500
600
700
800
900
10 15 20 25 30 35 40 45 50 55 60
Processors
tim
e se
c
Compa ES40q
Ideal
Cray T3E
Ideal
Performance comparisonsPerformance comparisonsPerformance comparisonsPerformance comparisons
Alpha EV67/667 MHz in AS SC40 delivers about 3 times the performance of EV5/600 MHz in Cray T3E to the LM application
EV5 is running at about 6.7% of peak EV67 is running at about 18.5% of peak
Compilation TimesCompilation TimesCompilation TimesCompilation Times Cray T3E
Flags: -O3 -O aggress,unroll2,split1,pipeline2 Compilation time: 41 min 37 sec
Compaq EV6/500 MHz (EV67 is faster)Flags: -fast -O4Compilation time: 5 min 15 sec
IBM SP3Flags: -04 -qmaxmem=-1Compilation time: 40 min 19 secNote: numeric_utilities.f90 had to be compiled with -O3 in order to avoid crashes
SWEEP3DSWEEP3D
3D discrete ordinates (Sn) neutron transport
Implicit wavefront algorithm– Convergence to stable solution
Target System - multitasked PVP / MPP– Vector style code
– High ratio of (load,stores) to flops memory bandwidth and latency sensitive performance is sensitive to grid size
SWEEP3D “as is” PerformanceSWEEP3D “as is” Performance
CPU/Mhz CPI Mflops % PeakEV5/613 2.27 113. 9.3EV6/500 1.57 135. 13.5EV7/1100 0.93 497. 22.6
Optimizations to SWEEP3DOptimizations to SWEEP3DOptimizations to SWEEP3DOptimizations to SWEEP3D
Fuse inner loops– demote temporary vectors to scalars
– reduce load/store count Separate loops with explicit values for “i2” = -1,1
– allows prefetch code to be generated Fixup code moved “outside” loop
– loop unrolling, pipelining
Instruction counts/iterationInstruction counts/iteration(+ measured cycles on EV6)(+ measured cycles on EV6)Instruction counts/iterationInstruction counts/iteration(+ measured cycles on EV6)(+ measured cycles on EV6)
Original OptimizedInstructions 144.1 90.4Loads 39.5 19.6Store 14.8 9.5
Cycles/instruction 1.57 .88Cycles/iteration 175 75.5
Optimized SWEEP3D PerformanceOptimized SWEEP3D Performance
CPU/Mhz CPI Mflops % PeakEV6/500 0.88 262. 26.2EV7/1100 0.66 767. 34.9
Each @ 128b 8.0GB/s
AlphaServer ES45 (EV68/1.001 GHz)AlphaServer ES45 (EV68/1.001 GHz)
CrossbarSwitch
(Typhoon chipset)
Each @ 64b (4.2GB/s)
Quad Ctl
Data S
lices (8)
PA PP
256b 4.2GB/s
256b 4.2GB/s
64b 266MB/s
Alpha 264Alpha 264
L2 CacheL2 Cache
SDRAM Memory133 MHz
128MB - 32GB
Bank 3
Bank 2
Bank 1
Bank 0Alpha 264
Alpha 264
L2 CacheL2 Cache
4xAGP
PCI 2
PCI 1
PCI 3
PCI 1
PCI 0
PCI 1
PCI 0
64b@66MHz512MB/s
64b@66MHz512MB/s
64b@66MHz256MB/s
32b@133MHz512MB/s
PCI 0
Pittsburgh Supercomputing Center (PSC)Pittsburgh Supercomputing Center (PSC)Pittsburgh Supercomputing Center (PSC)Pittsburgh Supercomputing Center (PSC)
Cooperative effort of– Carnegie Mellon
University– University of
Pittsburgh– Westinghouse
Electric Offices in Mellon
Institute– On CMU campus– Adjacent to UofP
campus
Westinghouse ElectricWestinghouse ElectricWestinghouse ElectricWestinghouse Electric
Energy Center, MonroevillePA
Major computing systems
High-speed network connections
Terascale Computing System at Terascale Computing System at Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterTerascale Computing System at Terascale Computing System at Pittsburgh Supercomputing CenterPittsburgh Supercomputing Center
Sponsored by the U.S. National Science Foundation
Integrated into the PACI program (Partnerships for Academic Computing Infrastructure)
Serving the “very high end” for academic computational science and engineering
The largest open facility in the world
PSC in collaboration with Compaq and with– Application scientists and engineers– Applied mathematicians– Computer Scientists– Facilities staff
Compaq AlphaServer SC technology
System System Block Block DiagramDiagram
System System Block Block DiagramDiagram
3040 CPUs Tru64 UNIX 3 TB memory 41 TB disk 152 CPU cabs 20 switch cabs
SWITCH
NODES
SERVERS
DISKSCONTROL
ES45 nodes– 5 per cabinet
– 3 local disks
Row upon row…Row upon row…Row upon row…Row upon row…
QuadricsQuadricsSwitchesSwitchesQuadricsQuadricsSwitchesSwitches
Rail 1 & Rail 0
Middle Aisle, Switches in CenterMiddle Aisle, Switches in CenterMiddle Aisle, Switches in CenterMiddle Aisle, Switches in Center
QSW switch chassisQSW switch chassisQSW switch chassisQSW switch chassis
Fully wired switch chassis
1 of 42
Control nodes and concentratorsControl nodes and concentratorsControl nodes and concentratorsControl nodes and concentrators
The The Front RowFront RowThe The Front RowFront Row
Installation: from 0 to 3.465 TFLOPS in 29 days Installation: from 0 to 3.465 TFLOPS in 29 days (Latest: 4.059 TFLOPS on 3024 CPUs)(Latest: 4.059 TFLOPS on 3024 CPUs)
Installation: from 0 to 3.465 TFLOPS in 29 days Installation: from 0 to 3.465 TFLOPS in 29 days (Latest: 4.059 TFLOPS on 3024 CPUs)(Latest: 4.059 TFLOPS on 3024 CPUs)
Deliveries & continual integration:– 44 nodes arrived at PSC on Saturday, 9-1-2001– 50 nodes arrived on Friday, 9-7-2001– 30 nodes arrived on Saturday, 9-8-2001– 50 nodes arrived on Monday, 9-10-2001– 180 nodes arrived on Wednesday, 9-12-2001– 130 nodes arrived on Sunday, 9-16-2001– 180 nodes arrived on Thursday, 9-20-2001
To have shipped 12 September! Federated switch cabled/operational by 9-23-01 760 nodes clustered by 9-24-01 3.465 TFLOPS Linpack by 9-29-01 4.059 TFLOPS in Dongarra’s list dated Mon Oct 22 (67% of peak
performance)
http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20011023.html
MM5
http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20011023.html
Alpha Microprocessor SummaryAlpha Microprocessor SummaryAlpha Microprocessor SummaryAlpha Microprocessor Summary
EV6– .35 m, 600 MHz– 4-wide superscalar– Out-of-order execution– High memory BW
EV67– .25 m, up to 750 MHz
EV68– .18 m, 1000 MHz
EV7– .18 m, 1250 MHz– L2 cache on-chip– Memory control on-chip– I/O control on-chip– cc inter-proc com on-
chip
EV79– .13 m, ~1600 MHz
EV7 – The System is the Silicon….EV7 – The System is the Silicon….
• EV68 core with enhancements• Integrated L2 cache
– 1.75 MB (ECC)– 20 GB/s cache bandwidth
• Integrated memory controllers– Direct RAMbus (ECC)– 12.8 GB/s memory bandwidth– Optional RAID in memory
• Integrated network interface– Direct processor-processor interconnects– 4 links - 25.6 GB/s aggregate bandwidth – ECC (single error correct, double error detect)– 3.2 GB/s I/O interface per processor
SMP CPU interconnect used to be external logic…
Now it’s on the chip
Alpha EV7
EV7 – The System is the Silicon….EV7 – The System is the Silicon….EV7 – The System is the Silicon….EV7 – The System is the Silicon….
Electronics to do cache-coherent communications gets placed within the EV7 chip
EV7
Int RegMap
Branch Predictors
Alpha EV7 CoreAlpha EV7 CoreAlpha EV7 CoreAlpha EV7 Core
FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6
L2 cache
1.75MB7-Set
Int Issue Queue
(20)
Exec
4 Instructions / cycle
Reg File(80)
Victim Buffer
L1 DataCache64KB2-Set
FP Reg Map
FP ADDDiv/Sqrt
FP MUL
Addr
80 in-flight instructionsplus 32 loads and 32 stores Addr
Miss Address
Next-LineAddress
L1 Ins.Cache64KB2-Set
Exec
Exec
ExecReg File(80)
FP Issue Queue
(15)
Reg File(72)
Virtual Page SizeVirtual Page SizeVirtual Page SizeVirtual Page Size
Current virtual page size– 8K
– 64K
– 512K
– 4M New virtual page size (boot time selection)
– 64K
– 2M
– 64M
– 512M
PerformancePerformancePerformancePerformance SPEC95
– SPECint95 75– SPECfp95 160
SPEC2000– CINT2000 800– CFP2000 1200
59% higher than EV68/1GHz
Building Block Approach to System DesignBuilding Block Approach to System DesignBuilding Block Approach to System DesignBuilding Block Approach to System Design
Key Components:
• EV7 Processor• IO7 I/O Interface• Dual Processor Module
Systems Grow by Adding: Processors Memory I/O
Two complementary views of the GridTwo complementary views of the GridTwo complementary views of the GridTwo complementary views of the GridThe hierarchy of understanding
Data are uninterpreted signals Information is data equipped
with meaning Knowledge is information
applied in practice to accomplish a task
The Internet is about information
The Grid is about knowledge
– Tony Hey, Director, UK eScience Core Program
Main technologies developed by man
Writing captures knowledge Mathematics enables rigorous
understanding, prediction Computing enables prediction of
complex phenomena The Grid enables intentional
design of complex systems
– Rick Stevens, ANL
What is the Grid?What is the Grid?What is the Grid?What is the Grid?
“A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computing capabilities.”
– Ian Foster and Carl Kesselman, editors, “The GRID: Blueprint for a New Computing Infrastructure” (Morgan-Kaufmann Publishers, SF, 1999) 677 pp. ISBN 1-55860-8
The Grid is an infrastructure to enable virtual communities to share distributed resources to pursue common goals
The Grid infrastructure consists of protocols, application programming interfaces, and software development kits to provide authentication, authorization, and resource location and access
– Foster, Kesselman, Tuecke: “The anatomy of the Grid: Enabling Scalable Virtual Organizations” http://www.globus.org/research/papers.html
Compaq and The GridCompaq and The GridCompaq and The GridCompaq and The Grid Sponsor of the Global Grid Forum (www.globalgridforum.org) Founding member of the New Productivity Initiative for
Distributed Resource Management (www.newproductivity.org) Industrial member of the GridLab consortium (www.gridlab.org)
– 20 leading European and US institutions– Infrastructure, applications, testbed– Cactus “worm” demo at SC2001 (www.cactuscode.org)
Intra-Grid within Compaq firewall– Nodes in Annecy, Galway, Nashua, Marlboro, Tokyo– Globus, Cactus, GridLab infrastructure and applications– iPAQ Pocket PC (www.ipaqlinux.com)
Potential dangers for the GridPotential dangers for the GridPotential dangers for the GridPotential dangers for the Grid
Solution in search of a problem Shell game for cheap (free) computing Plethora of unsupported, incompatible, non-
standard tools and interfaces
““Big Science”Big Science”““Big Science”Big Science” As with the Internet, scientific computing will be the first to
benefit from the Grid. Examples:– GriPhyN (US Grid Physics Network for Data-intensive
Science) Elementary particle physics, gravitational wave astronomy,
optical astronomy (digital sky survey) www.griphyn.org
– DataGrid (led by CERN) Analysis of data from scientific exploration www.eu-datagrid.org
– There are also compute-intensive applications that can benefit from the Grid
Final Thoughts: all this will not be easyFinal Thoughts: all this will not be easyFinal Thoughts: all this will not be easyFinal Thoughts: all this will not be easy
How good have we been as a community at making parallel computing easy and transparent?
There are still some things we can’t do– predict the El Niño phenomenon correctly
– plate tectonics and earth mantel convection
– failure mechanisms in new materials Validation and verification of numerical simulation
are crying needs
Thank You!
Please visit our HPTC Web Sitehttp://www.compaq.com/hpc
Stability & Continuity for Stability & Continuity for AlphaServerAlphaServer customerscustomersStability & Continuity for Stability & Continuity for AlphaServerAlphaServer customerscustomers
Commitment to continue implementing the Alpha Roadmap according to the current plan-of-record
– EV68, EV7 & EV79
– Marvel systems
– Tru64 UNIX support
– AlphaServer systems, running Tru64 UNIX, will be sold as long as customers demand, at least several years after EV79 system arrive in 2004, with support continuing for a minimum of 5 years beyond that
EV68
Itanium™Processor
Family
McKinley Madison Itanium Processor Family Next Generation
AlphaProcessor
ProLiantServers
Itanium™
1-32P
McKinley family
AlphaServers
EV7 EV79
20052004200320022001
EV7 Family8–64P (8P BB)2–8P (2P BB)
EV68 Product FamilyGS 1 - 32P ES 1 – 4P DS 1 – 2P
EV79 8–64P (8P BB)2–8P (2P BB)
Madison
EV68
Microprocessor and System Roadmaps
1-8P
Next GenerationServer Family
8-64P,
Blades,
2P, 4P, 8P
Itanium™1 – 4P 1–4P
The New HPThe New HPThe New HPThe New HP
Chairman and CEO Carly Fiorina President Michael Capellas Imaging and Printing $20B Vyamesh Joshi Access Devices $29B Duane Zitzen IT Infrastructure $23B Peter Blackmore Services $15B Ann Livermore