Exascale and Beyond
Marc Snir Argonne Na.onal Laboratory & University of Illinois at Urbana-‐Champaign
Supercomputing Today
! Mira -‐ Blue Gene/Q System – 48K nodes / 768K cores – 786 TB of memory – Peak flop rate: 10 PF
! Storage – ~35 PB capacity, 240GB/s bandwidth (GPFS)
October 14
MCS -‐-‐ Marc Snir
2
Supercomputing in the Next 5-8 Years
! Evolu.on toward exascale (x100 performance increase) ! Leverage con.nued evolu.on of CMOS, advances in packaging
(3D stacks), and Non-‐vola.le memory (NVRAM) ! Increased specializa.on of HPC technology – Intel Phi + NIC + stacked memory, GPU + CPU + NIC, Fat ARM + lean ARM + NIC
– Modify and reuse IP serving broader market but build unique chips and unique packages
! Exascale in 2022 seems feasible; – Possibly not for $200M and at 20 Mwags – What happens beyond exascale?
October 14
MCS -‐-‐ Marc Snir
3
What Can We Learn from the Past?
October 14
MCS -‐-‐ Marc Snir
4
The 1990 Big Extinction: The Attack of the Killer Micros (Eugene Brooks, 1990)
Shi$ from bipolar vector machines & to clusters of MOS micros ! Roadblock: bipolar circuits leaked too much current – it became too
hard to cool them (even with liquid nitrogen) ! MOS was leaking very ligle – did not require aggressive cooling ! MOS was used in fast growing markets: controllers, worksta.ons, PCs ! MOS had a 20 year history and clear evolu.on path (“Moore’s Law”) ! MOS was slower
– Cray C90 vs. CM5 in 1991: 244 MHz vs. 32 MHz
October 14
MCS -‐-‐ Marc Snir
5
• Perfect example of “good enough” technology
(Christensen, The Innovator’s Dilemma)
The CMOS Age: Killer Micros, Moore’s Law & Dennard Scaling (1990-2020 (?))
October 14
MCS -‐-‐ Marc Snir
6
Dennard Scaling: • Decrease feature size by a factor of λ and decrease voltage by a factor of λ; then • # transistors increase by λ2 • Clock speed increases by λ • Energy consump.on does
not change (in reality, voltage decrease was
slower; clock speed and energy consump.on increased faster)
What Can We Say About the Future?
October 14
MCS -‐-‐ Marc Snir
7
Stein’s Law: If something cannot go forever, it will stop
! Dennard scaling ended at around 130 nm in 2001-‐2004 – Leakage (sta.c energy) does not scale – it increases as device size shrinks
! Growth in density con.nues (mul.core), but clock speed is (slowly) decreasing
While power consump=on is an urgent challenge, its leakage or sta=c component will become a major industry crisis in the long term, threatening the survival of CMOS technology itself, just as bipolar technology was threatened and eventually disposed of decades ago
Interna.onal Technology Roadmap for Semiconductors (ITRS) 2011 ! The ITRS “long term” is the 2017-‐2024 .meframe.
October 14
MCS -‐-‐ Marc Snir
8
Have We Been There?
! History repeats itself: – CMOS technology has hit a power wall, same as ECL in late 80’es • Clock speed is not raising
– Alterna.ve materials are not ready (gallium arsenide and other III-‐V materials; nanowires, nanotubes)
! History does not repeat itself: ✔ There is a much larger industrial base inves.ng in con.nued improvements in current technologies
✗ An alterna.ve “good enough” technology (such as MOS in 1990) IS NOT ready
✗ There is much more code that needs to be rewrigen if a new model is needed (>200MLOCs)
October 14
MCS -‐-‐ Marc Snir
9
The Physical & Engineering Limits
Transistor size cannot shrink forever ! Gate of 0.5 nm (few atoms), < 10 dopant atoms per gate: – Tunnel effects (new devices?); large variance – x nm is the limit (x = 5 nm?)
October 14
MCS -‐-‐ Marc Snir
10
(now) (~2022)
World is 3D ! Current leak through dialectric reduced by using 3D FinFet. But
this increases manufacturing cost, and is a one-‐.me trick.
October 14
MCS -‐-‐ Marc Snir
11
The Physical & Engineering Limits (2)
! Need new materials (III/V, nanotubes…) by end of decade – Expensive to grow different materials with different atom distances (and source of defects)
! Need UV lithography now (currently use 192 nm laser) – Hard to get sufficient light intensity (mirrors, rather than diffrac.on) – Intel invested $4B in ASML; more from TSMC and Samsung
October 14
MCS -‐-‐ Marc Snir
12
Reduction in Feature Size is Less Effective than in the Past
! Not all gates and wires can shrink
! 50%-‐80% of gates are repeaters; their number is increasing
! Addi.onal spacing and larger safety margins are needed to reduce interference, handle manufacturing variances, etc.
! Scagering increases ! Industry “size” is increasingly
meaningless: – 2013 “14 nm” – 2025 “1.8 nm”
x60 increased density? – actually ½ pitch reduced by x4
and density increased by x15 (ITRS 2013)
October 14
MCS -‐-‐ Marc Snir
13
Evolu.on of wire stack
Year 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
Gate Length 18 16.7 15.2 13.9 12.7 11.6 10.6 9.7 8.8 8.0 7.3
Equivalent Oxide Thickness ● ● ● ● ● ● ● ● ● ● ● Source-‐Drain Leakage* ● ● ● ● ● ● ● ● ● ● ● Threshold Voltage* ● ● ● ● ● ● ● ● ● ● ●
CV/I Intrinsic Delay ● ● ● ● ● ● ● ● ● ● ●
Total Gate Capacitance ● ● ● ● ● ● ● ● ● ● ● Drive Current ● ● ● ● ● ● ● ● ● ● ●
Technological challenges of high-‐performance logic scaling
● technology available ● solu.ons known ● no known solu.ons
• Time line shown for best performing mul.-‐gate transistor technology.
• Based on ITRS reports (2011*, 2012* and 2013 eds.) (courtesy Denis Mamaluy)
The Future Is Not What It Was
14 October 2014
ANL-‐LBNL-‐ORNL-‐PNNL
15
(courtesy J. Aidun)
(ITRS 2013 same as 2011, but stops at 2020)
The Economical Limits May be the Main Barrier ! Cost per transistor has
not decreased last year – Market for increased
performance at increased cost is very small
! Investments in R&D and new fabs keep growing, and cost per chip keeps growing, resul.ng in increased consolida.on
October 14
MCS -‐-‐ Marc Snir
16
• 2009 analysis, assuming R&D of 10-‐12% of revenue and new genera.on every 2 years
• R&D now 16-‐18% and new genera.on comes every 3 years
The Market Constraints
! Leading market for IC is mobile. The drivers in the market have ligle overlap with HPC. ✔ low power ✔ system on chip ✗ small form factor ✗ integra.on of analog and MEMS ✗ limited interest in low error rate ✗ no interest in 64 bit floa.ng point and higher (same for ML/datamining)
October 14
MCS -‐-‐ Marc Snir
18
Current Growth in IC manufacturing (and IT industry as a Whole) Cannot Continue Forever
! Yearly semiconductor shipments ($1000K) – CGA of close to 8% -‐-‐ twice as faster as worldwide GDP growth
October 14
MCS -‐-‐ Marc Snir
19
0
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
300,000,000
350,000,000
America
Asia-‐Pacific
EUROPE
Japan
Worldwide
What Next?
October 14
MCS -‐-‐ Marc Snir
20
Design Space (ITRS 2013)
October 14
MCS -‐-‐ Marc Snir
21
New Directions
October 14
MCS -‐-‐ Marc Snir
22
Leverage Mobile Technologies
! ARM based systems (Mont Blanc -‐-‐ Bull) ! Low power – massive parallelism ! So�ware resilience (?) ! Leverage ecosystem of SOC vendors (quick integra.on of
components from mul.ple companies)
! Does gap between mobile and HPC negate cost & power advantage of mobile technology?
October 14
MCS -‐-‐ Marc Snir
23
Orthogonal Scaling – Server/Cloud Technology (?)
! Scale above chip level – More func.on per volume – Higher interchip bandwidth – Lower energy cost for interchip communica.on
– 3D stacks – Op.cal interconnects (silicon photonics, photonic waveguides) – Immersive cooling?
– How much will industry invest in technologies for the high-‐end server market?
– To what extent will such technology be available in reusable building blocks?
October 14
MCS -‐-‐ Marc Snir
24
October 14
MCS -‐-‐ Marc Snir
25
Leverage (More) Special-Purpose Architectures
! Specializa.on becomes more affordable if device technology plateaus
October 14
MCS -‐-‐ Marc Snir
26
Algorithm-‐specific system (Anton) HPC-‐specific SOC (Blue Gene)
Spectrum
MD; x50-‐100 faster than conven.onal cluster
Special Purpose Memory Subsystem? No Caches – Only Scratchpads
! Most CPU energy is consumed in memory subsystem – Flops are 2% of energy for Linpack (Kogge)
! Most of it is “wasted” – E.g., 10’s of SRAM accesses in order to bring data from memory
! Can we specialize memory subsystem for HPC?
! General concept: Fric=onless architecture
October 14
MCS -‐-‐ Marc Snir
27
Approximate Computing
! There is a tradeoff between power consump.on and accuracy of computa.on – Reduce precision (32 bit vs. 64 bit) – Use approximate FPU (provides right result for “most” inputs) – Tolerate higher error rate in CPU and in memory
! Algorithmic work: Show savings are not negated by extra effort to tolerate more “errors”
! Device/architecture work: How to use noisy CMOS gates – We know it is doable: see overclocked systems
! Can “noise” become a knob in conven.onal architectures?
October 14
MCS -‐-‐ Marc Snir
28
Weather Simulation – Lorenz 96’ Model
29
Applica.on Quality Impact due to Inexactness
Xn = Large Scale Parameters;1 model .me unit = 5 days
Short-‐term Forecasts
Exact&Floa*ng&Point&Instruc*ons&
Inexact&Floa*ng&Point&Instruc*on&
Memory&Access&Instruc*ons&
Other&Instruc*ons&
Inexact FloaQng Point InstrucQons
(24%)
Memory Access
InstrucQons (35%)
Other InstrucQons (Branch,
integer etc.) (35%)
Exact FloaQng Point InstrucQons
(6%)
GAIN area power delay
adder 70% 66% 26%
mul.plier 92% 92% 19%
On the use of inexact, pruned hardware in atmospheric modelling Düben, Joven, Lingamneni, McNamara, De Micheli, Palem and Palmer
New Device Technology
! Cryogenic Compu.ng – quantum logic (IARPA) – Use Josephson junc.on for logic – Use current in superconduc.ng loop for storage
! Communica=on is free (on chip/off chip); computa=on is (rela=vely) expensive – Will lead to very different architectures (back to the future)
! Much larger feature size (193 nm end of decade) ! Very different ecosystem – No cryogenic cell-‐phones; current market is for small, special-‐purpose devices sold by small companies
October 14
MCS -‐-‐ Marc Snir
30
October 14
MCS -‐-‐ Marc Snir
31
23
NVIDIA Projections for CMOS vs. Superconductor RQL Circuits
z Projected energy consumption for RQL processors:• Communication is Cheap, FLOPS are Expensive
(cmp. to Communication), Instruction Scheduling & Main memory access costs are TBD
• On-chip data transfer takes negligible energy • ~5,000-10,000X LESS on-chip than in 10 nm CMOS
• Off-chip communication has ~ same negligible energy costs as on-chip one @ the same rates
• Floating-point operations take most energy • ~2-3X LESS energy/op than in 10 nm CMOS
MIT LL fabrication process technology
248 nm(~2017)
193 nm (~2019)
JJ critical current density Jc 10 kA/cm2
10 kA/cm2
Min. JJ critical current Ic 38 20
Frequency 8.5 GHz 10 GHz
Energy for a DP FP Multiply-Add 4.2 pJ 2.2 pJEnergy for a 64-bit integer add 0.21 pJ 0.11 pJ
64-bit read from a 64x64-bit register file (dynamic)
0.15 pJ 0.08 pJ
Wire (PTL) energy per non-zero bit (incl. drivers & receivers) 1-20 mm
0.25 fJ/bit
0.13 fJ/bit
Wire (PTL) energy (256 bits, 1-20 mm) (random data)
0.032pJ
0.017pJ
Mikhail Dorojevets, HPC 2014
B. Dally, DOE Exaflops WS, 2011
with the cryocooling efficiency of 0.1% (1000 W/W)
Adiabatic & Reversible Computing
! Possible follow-‐up to cryogenic compu.ng ! Theore.cal limit on switching energy: ln(2) kT ! Current CMOS > 100,000 kT ! Demonstrated energy consump.on close to theore.cal limit for
simple circuits (e.g., shi� register built of nSQUIDS, 5 GHz, 4K)
October 14
MCS -‐-‐ Marc Snir
32
! Same device technology could be used for quantum compu.ng
Something Totally Different
! Analog compu.ng – Op.cal (hybrid) compu.ng [Optalysys] – analog op.cal computa.on (Matrix product, Fourier transform) – “Optalysys thinks its Op=cal Solver could scale up to 17.1 exaflops by 2020”.
! Quantum Compu.ng (and quantum annealing) ! Neuromorphic compu.ng ! Biocompu.ng None seem to apply to scien=fic simula=ons but could apply to data analysis
October 14
MCS -‐-‐ Marc Snir
33
Summary
! Exascale will be there by 2022 or so ! There is no abrupt transi.on to a new technology beyond 2022 –
it just becomes progressively harder to improve performance. – Circuit technology contributes less and less – Architectures and algorithms contribute more and more
! New devices and new designs can be used to con.nue improvements in performance beyond 2025
! But the development of these technologies might not be supported by a mass market and need interna=onal collabora=on
! Top supercomputers become alike large experimental facili.es! – High cost, specialized hardware, long development .me, long use .me
October 14
MCS -‐-‐ Marc Snir
34
October 14
MCS -‐-‐ Marc Snir
35