Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Technische Universitat Munchen
Hybrid Architectures – Why Should I Bother?
CSCS-FoMICS-USI Summer School onComputer Simulations in Science and Engineering
Michael Bader
July 8–19, 2013
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 1
Technische Universitat Munchen
The Simulation Pipelinephenomenon, process etc.
mathematical model?
modelling
numerical algorithm?
numerical treatment
simulation code?
parallel implementation
results to interpret?
visualization
�����
HHHHj embeddingstatement tool
-
-
-
validation
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 2
Technische Universitat Munchen
Parallel Computing – “Faster, Bigger, More”
Why parallel high performance computing:• Response time: compute a problem in 1
p time– speed up engineering processes– real-time simulations (tsunami warning?)
• Problem size: compute a p-times bigger problem– simulation of large-/multi-scale phenomena– maximal problem size that “fits into the machine”– validation of smaller, “operational” models
• Throughput: compute p problems at once– case and parameter studies, statistical risk scenarios, etc.
(hazard maps, data base for tsunami warning, . . . )– massively distributed computing (SETI@home, e.g.)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 3
Technische Universitat Munchen
Part I
High Performance Computingin CSE –
Past(?) and Present Trends
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 4
Technische Universitat Munchen
The Seven Dwarfs of HPC“dwarfs” = key algorithmic kernels in many scientific computing
applications
P. Colella (LBNL), 2004:1. dense linear algebra2. sparse linear algebra3. spectral methods4. N-body methods5. structured grids6. unstructured grids7. Monte Carlo
Tsunami & storm-surge simulation:→ usually PDE solvers on structured or unstructured meshes→ SWE: a simple shallow water solver on Cartesian grids
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 5
Technische Universitat Munchen
Computational Science Demands a New Paradigm
Computational simulation must meet three challenges to become amature partner of theory and experiment
(Post & Votta, 2005)
1. performance challenge→ exponential growth of performance, massively parallel
architectures2. programming challenge→ new (parallel) programming models
3. prediction challenge→ careful verification and validation of codes;
towards reproducible simulation experiments
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 6
Technische Universitat Munchen
Four Horizons for Enhancing the Performance . . .
. . . of Parallel Simulations Based on Partial Differential Equations(David Keyes, 2000)
1. Expanded Number of Processors→ in 2000: 1000 cores; in 2010: 200,000 cores
2. More Efficient Use of Faster Processors→ PDF working-sets, cache efficiency
3. More Architecture-Friendly Algorithms→ improve temporal/spatial locality
4. Algorithms Delivering More “Science per Flop”
→ adaptivity (in space and time), higher-order methods, fastsolvers
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 7
Technische Universitat Munchen
Performance Development in Supercomputing
(source: www.top500.org)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 8
Technische Universitat Munchen
Top 500 (www.top500.org) – June 2013
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 9
Technische Universitat Munchen
Top 500 Spotlights – Tianhe-2 and K Computer
Tianhe-2/MilkyWay-2→ Intel Xeon Phi (NUDT)• 3.1 mio cores(!) – Intel Ivy Bridge and Xeon Phi• Linpack benchmark: 33.8 PFlop/s• ≈ 17 MW power(!!)• Knights Corner / Intel Xeon Phi / Intel MIC as accelerator• 61 cores, roughly 1.1–1.3 GHz
Titan→ Cray XK7, NVIDIA K20x (ORNL)• 18,688 compute nodes; 300,000 Opteron cores• 18,688 NVIDIA Tesla K20 GPUs• Linpack benchmark: 17.6 PFlop/s• ≈ 8.2 MW power
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 10
Technische Universitat Munchen
Top 500 Spotlights – Sequoia and K Computer
Sequoia→ IBM BlueGene/Q (LLNL)• 98,304 compute nodes; 1.6 mio cores• Linpack benchmark: 17.1 PFlop/s• ≈ 8 MW power
K Computer→ SPARC64 (RIKEN, Kobe)• 88,128 processors; 705,024 cores• Linpack benchmark: 10.51 PFlop/s• ≈ 12 MW power• SPARC64 VIIIfx 2.0GHz (8-core CPU)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 11
Technische Universitat Munchen
Performance Development in Supercomputing
(source: www.top500.org)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 12
Technische Universitat Munchen
International Exascale Software Project Roadmap
Towards an Exa-Flop/s Platform in 2018 (www.exascale.org):1. technology trends→ concurrency, reliability, power consumption, . . .→ blueprint of an exascale system: 10-billion-way concurrency,
100 million to 1 billion cores, 10-to-100-way concurrency percore, hundreds of cores per die, . . .
2. science trends→ climate, high-energy physics, nuclear physics, fusion energy
sciences, materials science and chemistry, . . .3. X-stack (software stack for exascale)→ energy, resiliency, heterogeneity, I/O and memory
4. Polito-economic trends→ exascale systems run by government labs, used by CSE
scientists
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 13
Technische Universitat Munchen
Exascale Roadmap“Aggressively Designed Strawman Architecture”
Level What Perform. Power RAMFPU FPU, regs,. instr.-memory 1.5 Gflops 30mWCore 4 FPUs, L1 6 Gflops 141mW
Proc. Chip 742 cores, L2/L3, Interconn. 4.5 Tflops 214WNode Proc. chip, DRAM 4.5 Tflops 230W 16 GBGroup 12 proc. chips, routers 54 Tflops 3.5KW 192 GBrack 32 groups 1.7 Pflops 116KW 6.1 TB
System 583 racks 1 Eflops 67.7MW 3.6 PB
approx. 285,000 cores per rack; 166 mio cores in total
Source:ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 14
Technische Universitat Munchen
Exascale Roadmap – Should You Bother?
Your department’s compute cluster in 5 years?
• a Petaflop System!• “one rack of the Exaflop system”→ using the same/similar hardware
• extrapolated example machine:• peak performance: 1.7 PFlop/s• 6 TB RAM, 60 GB cache memory• “total concurrency”: 1.1 · 106
• number of cores: 280,000• number of chips: 384
Source: ExaScale Software Study: Software Challenges in Extreme ScaleSystems
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 15
Technische Universitat Munchen
Your Department’s PetaFlop/s Cluster in 5 Years?
Tianhe-1A (Tianjin, China; Top500 # 10 )• 14,336 Xeon X5670 CPUs• 7,168 Nvidia Tesla M2050 GPUs• Linpack benchmark: ≈ 2.6 PFlop/s• ≈ 4 MW power
Stampede (Intel, Top500 # 6)• 102,400 cores (incl. Xeon Phi: MIC/“many integrated cores”)• Linpack benchmark: ≈ 5 PFlop/s• Knights Corner / Intel Xeon Phi / Intel MIC as accelerator• 61 cores, roughly 1.1–1.3 GHz• wider vector FP units: 64 bytes (i.e., 16 floats, 8 doubles)• ≈ 4.5 MW power
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 16
Technische Universitat Munchen
Free Lunch is Over(∗). . . actually already over for quite some time!
Speedup of your softwarecan only come from parallelism:
• clock speed of CPU has stalled• instruction-level parallelism
per core has stalled• number of cores is growing• size of vector units is growing
(∗) Quote and image taken from: H. Sutter, The Free Lunch Is Over:A Fundamental Turn Toward Concurrency in Software, Dr. Dobb’s Journal 30(3),March 2005.
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 17
Technische Universitat Munchen
Manycore CPU – Intel MIC Architecture
Intel® MIC Architecture:An Intel Co-Processor Architecture
VECTORIA CORE
INTERPROCESSOR NETWORK
INTERPROCESSOR NETWORK
FIX
ED F
UN
CTIO
N L
OGI
C
MEM
ORY
and
I/O
INTE
RFA
CESVECTOR
IA COREVECTORIA CORE
VECTORIA CORE
VECTORIA CORE
VECTORIA CORE
VECTORIA CORE
VECTORIA CORE
COHERENTCACHE
…
……
…
COHERENTCACHE
COHERENTCACHE
COHERENTCACHE
COHERENTCACHE
COHERENTCACHE
COHERENTCACHE
COHERENTCACHE
Many cores and many, many more threads
Standard IA programming and memory model
(source: Intel/K. Skaugen – SC’10 keynote presentation)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 18
Technische Universitat Munchen
Manycore CPU – Intel MIC Architecture (2)Diagram of a Knights Corner core:
Rev 20121015 Intel Xeon processors and Intel Xeon Phi coprocessors page Copyright © 2012, Intel Corporation. All rights reserved. 4
The cores are in-order dual issue x86 processor cores which trace some history to the original Pentium design, but with the addition of 64-bit support, four hardware threads per core, power management, ring interconnect support, 512 bit SIMD capabilities and other enhancements these are hardly the Pentium cores of 20 years ago. The x86-specific logic (excluding L2 caches) makes up less than 2% of the die area for an Intel Xeon Phi coprocessor.
Here are key facts about the first Intel Xeon Phi coprocessor product:
A coprocessor (requires at least one processor in the system), in production in 2012. Runs Linux* (source code available http://intel.com/software/mic). Manufactured using Intel’s 22nm process technology with 3-D Trigate transistors. Supported by standard tools including Intel Cluster Studio XE 2013. A list of additional
tools available can be found online (http://intel.com/software/mic). Many cores:
o More than 50 cores (it will vary within a generation of products, and between generations; it is good advice to avoid hard-coding applications to a particular number).
o In-order cores support 64-bit x86 instructions with uniquely wide SIMD capabilities.
o Four hardware threads on each core (resulting in more than 200 hardware threads available on a single device) are primarily used to hide latencies implicit in an in-order microarchitecture. In practice, use of at least two threads per core is nearly always beneficial. As such, it is much more important that applications
Figure 4: Knights Corner Core(source: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 19
Technische Universitat Munchen
GPGPU – NVIDIA Fermi
7
Hardware Execution
CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes
one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks;
and CUDA cores and other execution units in the SM execute threads. The SM executes
threads in groups of 32 threads called a warp. While programmers can generally ignore warp
execution for functional correctness and think of programming one thread, they can greatly
improve performance by having threads in a warp execute the same code path and access
memory in nearby addresses.
An Overview of An Overview of An Overview of An Overview of the Fermi Architecturethe Fermi Architecturethe Fermi Architecturethe Fermi Architecture
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA
cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The
512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory
partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM
memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread
global scheduler distributes thread blocks to SM thread schedulers.
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical
rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache).
(source: NVIDIA – Fermi Whitepaper)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 20
Technische Universitat Munchen
GPGPU – NVIDIA Fermi (2)
8
Third Generation Streaming
Multiprocessor
The third generation SM introduces several
architectural innovations that make it not only the
most powerful SM yet built, but also the most
programmable and efficient.
512 High Performance CUDA cores
Each SM features 32 CUDA
processors—a fourfold
increase over prior SM
designs. Each CUDA
processor has a fully
pipelined integer arithmetic
logic unit (ALU) and floating
point unit (FPU). Prior GPUs used IEEE 754-1985
floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point
standard, providing the fused multiply-add (FMA)
instruction for both single and double precision
arithmetic. FMA improves over a multiply-add
(MAD) instruction by doing the multiplication and
addition with a single final rounding step, with no
loss of precision in the addition. FMA is more
accurate than performing the operations
separately. GT200 implemented double precision FMA.
In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,
multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly
designed integer ALU supports full 32-bit precision for all instructions, consistent with standard
programming language requirements. The integer ALU is also optimized to efficiently support
64-bit and extended precision operations. Various instructions are supported, including
Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population
count.
16 Load/Store Units
Each SM has 16 load/store units, allowing source and destination addresses to be calculated
for sixteen threads per clock. Supporting units load and store the data at each address to
cache or DRAM.
Dispatch Unit
Warp Scheduler
Instruction Cache
Dispatch Unit
Warp Scheduler
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
Uniform Cache
Core
Register File (32,768 x 32-bit)
CUDA Core
Operand Collector
Dispatch Port
Result Queue
FP Unit INT Unit
Fermi Streaming Multiprocessor (SM)
8
Third Generation Streaming
Multiprocessor
The third generation SM introduces several
architectural innovations that make it not only the
most powerful SM yet built, but also the most
programmable and efficient.
512 High Performance CUDA cores
Each SM features 32 CUDA
processors—a fourfold
increase over prior SM
designs. Each CUDA
processor has a fully
pipelined integer arithmetic
logic unit (ALU) and floating
point unit (FPU). Prior GPUs used IEEE 754-1985
floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point
standard, providing the fused multiply-add (FMA)
instruction for both single and double precision
arithmetic. FMA improves over a multiply-add
(MAD) instruction by doing the multiplication and
addition with a single final rounding step, with no
loss of precision in the addition. FMA is more
accurate than performing the operations
separately. GT200 implemented double precision FMA.
In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,
multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly
designed integer ALU supports full 32-bit precision for all instructions, consistent with standard
programming language requirements. The integer ALU is also optimized to efficiently support
64-bit and extended precision operations. Various instructions are supported, including
Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population
count.
16 Load/Store Units
Each SM has 16 load/store units, allowing source and destination addresses to be calculated
for sixteen threads per clock. Supporting units load and store the data at each address to
cache or DRAM.
Dispatch Unit
Warp Scheduler
Instruction Cache
Dispatch Unit
Warp Scheduler
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
Uniform Cache
Core
Register File (32,768 x 32-bit)
CUDA Core
Operand Collector
Dispatch Port
Result Queue
FP Unit INT Unit
Fermi Streaming Multiprocessor (SM) (source: NVIDIA – Fermi Whitepaper)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 21
Technische Universitat Munchen
GPGPU – NVIDIA Fermi (3)
General Purpose Graphics Processing Unit:
• 512 CUDA cores• improved double precision
performance• shared vs. global memory• new: L1 und L2 cache (768 KB)• trend from GPU towards CPU?
15
Memory Subsystem Innovations
NVIDIA Parallel DataCacheTM with Configurable L1 and Unified L2 Cache
Working with hundreds of GPU computing
applications from various industries, we learned
that while Shared memory benefits many
problems, it is not appropriate for all problems.
Some algorithms map naturally to Shared
memory, others require a cache, while others
require a combination of both. The optimal
memory hierarchy should offer the benefits of
both Shared memory and cache, and allow the
programmer a choice over its partitioning. The
Fermi memory hierarchy adapts to both types of
program behavior.
Adding a true cache hierarchy for load / store
operations presented significant
challenges. Traditional GPU architectures
support a read-only ‘‘load’’ path for texture
operations and a write-only ‘‘export’’ path for
pixel data output. However, this approach is
poorly suited to executing general purpose C or
C++ thread programs that expect reads and
writes to be ordered. As one example: spilling a
register operand to memory and then reading it
back creates a read after write hazard; if the
read and write paths are separate, it may be necessary to explicitly flush the entire write /
‘‘export’’ path before it is safe to issue the read, and any caches on the read path would not be
coherent with respect to the write data.
The Fermi architecture addresses this challenge by implementing a single unified memory
request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2
cache that services all operations (load, store and texture). The per-SM L1 cache is
configurable to support both shared memory and caching of local and global memory
operations. The 64 KB memory can be configured as either 48 KB of Shared memory with 16
KB of L1 cache, or 16 KB of Shared memory with 48 KB of L1 cache. When configured with
48 KB of shared memory, programs that make extensive use of shared memory (such as
electrodynamic simulations) can perform up to three times faster. For programs whose memory
accesses are not known beforehand, the 48 KB L1 cache configuration offers greatly improved
performance over direct access to DRAM.
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 22
Technische Universitat Munchen
Parallel Computing Paradigms
Not exactly sure how the hardware will look like . . .(CPU-style, GPU-style, something new?)
However: massively parallel programming required
• revival of vector computing→ several/many FPUs performing the same operation
• hybrid/heterogenous achitectures→ different kind of cores; dedicated accelerator hardware
• different access to memory→ cache and cache coherency→ small amount of memory per core
• our concern in this course: data parallelism→ vectorisation (and a look into GPU computing)
Michael Bader: Hybrid Architectures – Why Should I Bother?
Computer Simulations in Science and Engineering, July 8–19, 2013 23