Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
MANY-CORE
COMPUTING Ana Lucia Varbanescu, UvA
Original slides: Rob van Nieuwpoort, eScience Center 6-Oct-2014
Schedule
1. Introduction and Programming Basics (2-10-
2014)
2. Performance analysis (6-10-2014)
3. Advanced CUDA programming (9-10-2014)
4. Case study: LOFAR telescope with many-
cores
by Rob van Nieuwpoort (??)
2
GPUs @ AMD 3
Radeon R9 Top of the line: R9 295X2
For comparison: R9 290X Performance: 5.6 TFLOPs
Memory: 4GB Bandwidth: 320GB/s
NVIDIA GTX980 (Maxwell) Performance: 5.0 TFLOPs
Memory: 4GB Lower bandwidth: 224 GB/s
NVIDIA GTX Titan Black (Kepler) Performance: 5.3 TFLOPs
Memory: 6GB Higher bandwidth: 336 GB/s
NVIDIA GTX Titan Z vs R9 295X2: fairly similar numbers, higher DP performance
Today 4
Revisit the VectorAdd
For GPUs
For many-core CPUs
Hardware revisited
Performance analysis
Hardware performance
Application performance
VectorAdd revisited 5
Vector add: sequential 6
void vector_add(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i++) {
c[i] = a[i] + b[i];
}
}
Vector add: GPU code (skeleton) 7
// compute vector sum c = a + b
// each thread performs one pair-wise addition
__global__ void vector_add(int N,float* A,float* B,float* C){
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i<N) C[i] = A[i] + B[i];
}
int main() {
// initialization code here ...
N = 5120;
// launch N/256 blocks of 256 threads each
vector_add<<< N/256, 256 >>>(deviceA, deviceB, deviceC);
// cleanup code here ...
}
Device code
Host code
(should be in the same file)
Multi-core CPU programming
Two levels of parallelism:
Coarse-grain: threads / processes
Fine-grain: SIMD operations
Instantiate the threads
Pthreads
Java threads
OpenMP
MPI
Vectorize
Rely on compilers
Manual vectorization
Vector types
Intrinsics
8
OpenMP 9
Add directives to sequential code for parallel
sections.
// Phi function to add two vectors
vector_add_Phi(int n, int* a, int* b, int* c) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
// main program
int main() {
int i, in1[SIZE], in2[SIZE], res[SIZE];
{
vector_add_Phi(SIZE, in1, in2, res);
}}
OpenMP (for Xeon Phi, too) 10
Add directives to sequential code for parallel
sections.
// Phi function to add two vectors
__attribute__((target(mic)))
vector_add_Phi(int n, int* a, int* b, int* c) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
// main program
int main() {
int i, in1[SIZE], in2[SIZE], res[SIZE];
#pragma offload target(mic) in(in1,in2) inout(res)
{
vector_add_Phi(SIZE, in1, in2, res);
}}
For Xeon Phi
For Xeon Phi
Cilk (for Xeon Phi, too) 11
Add directives to parallelize sequential code
by divide-and-conquer
cilk VectorAdd(float *a, float *b, float *c, int n){
if (n<GrainSize) {
int i;
for(i=0; i<n; ++i)
a[i] = b[i]+c[i];
}
else {
spawn (a,b,c,n/2);
spawn (a+n/2,b+n/2,c+n/2,n/2);
}
}
Vectorization on x86 architectures 12
Sinc
e
Name Bits Single
precision
vector size
Double
precision
vector size
1996 MultiMedia eXtensions (MMX) 64 bit Integer only Integer only
1999 Streaming SIMD Extensions
(SSE)
128 bit 4 float 2 double
2011 Advanced Vector Extensions
(AVX)
256 bit 8 float 4 double
2012 Intel Xeon Phi accelerator
(was Larrabee, MIC)
512 bit 16 float 8 double
Vectorizing with SSE
Assembly instructions
Execute on vector registers
C or C++: intrinsics
Declare vector variables
Name instruction
Work on variables, not registers
13
Vectorizing with SSE examples
float data[1024];
// init: data[0] = 0.0, data[1] = 1.0, data[2] = 2.0, etc.
init(data);
// Set all elements in my vector to zero.
__m128 myVector0 = _mm_setzero_ps();
// Load the first 4 elements of the array into my vector.
__m128 myVector1 = _mm_load_ps(data);
// Load the second 4 elements of the array into my vector.
__m128 myVector2 = _mm_load_ps(data+4);
0.0
0 element
value
1 2 3
0.0 0.0 0.0
0.0
0 element
value
1 2 3
3.0 2.0 1.0
4.0
0 element
value
1 2 3
7.0 6.0 5.0
14
Vectorizing with SSE examples
// Add vectors 1 and 2; instruction performs 4 FLOP.
__m128 myVector3 = _mm_add_ps(myVector1, myVector2);
// Multiply vectors 1 and 2; instruction performs 4 FLOP.
__m128 myVector4 = _mm_mul_ps(myVector1, myVector2);
// _MM_SHUFFLE(w,x,y,z) selects w&x from vec1 and y&z from vec2.
__m128 myVector5 = _mm_shuffle_ps(myVector1, myVector2,
_MM_SHUFFLE(2, 3, 0, 1));
0 element
value
1 2 3
4.0 = + 6.0 8.0 10.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 = x 5.0 12.0 21.0
0 element
value
1 2 3
2.0 = 3.0 4.0 5.0 s
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
0 element
value
1 2 3
0.0 1.0 2.0 3.0
0 element
value
1 2 3
4.0 5.0 6.0 7.0
15
Vector add with SSE: unroll loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i += 4) {
c[i+0] = a[i+0] + b[i+0];
c[i+1] = a[i+1] + b[i+1];
c[i+2] = a[i+2] + b[i+2];
c[i+3] = a[i+3] + b[i+3];
}
}
16
Vector add with SSE: vectorize
loop
void vectorAdd(int size, float* a, float* b, float* c) {
for(int i=0; i<size; i += 4) {
__m128 vecA = _mm_load_ps(a + i); // load 4 elts from a
__m128 vecB = _mm_load_ps(b + i); // load 4 elts from b
__m128 vecC = _mm_add_ps(vecA, vecB); // add four elts
_mm_store_ps(c + i, vecC); // store four elts
}
}
17
Optional assignment 18
Implement a vectorized version of
Element-wise array multiplication, with complex
numbers
Element-wise array division, with complex numbers
Compile with gcc and measure performance
with/without vectorization.
Send (pseudo-)code (and performance numbers,
if you have them) by email to
CPUs
NVIDIA GPUs
Hardware revisited 19
Generic multi-core CPU 20
Hardware threads
SIMD units (vector lanes)
L1 and L2
dedicated
caches
Shared L3/L4 cache Main memory, I/O
Peak
performance
Bandwidth
Generic GPU 21
Single or SIMD execution units Hardware scheduler
Local memory/cache Units for executing
functions with high precision
Peak
performance
Bandwidth
NVIDIA GPUs 22
Kepler: Larger SM (SMX), more registers, better scheduler, dynamic parallelism, multi-GPU
Maxwell: Modular SM (SMM), dedicated registers, dedicated schedulers, more L2 cache
Platform architecture (Fermi) 23
Memory architecture (from Fermi) 24
Configurable L1 cache per SM
16KB L1 cache / 48KB Shared
48KB L1 cache / 16KB Shared
Shared L2 cache
Device memory
L2 cache
Host memory
PCI-e
bus
registers
L1 cache /
shared mem
registers
L1 cache /
shared mem ….
Fermi 25
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine Consumer: GTX 480, 580
HPC: Tesla C2050
More memory, ECC
1.0 TFlop SP
515 Gflop DP
16 streaming multiprocessors (SM) GTX 580: 16
GTX 480: 15
C2050: 14
768 KB L2 cache
Fermi : SM 26
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine
L2 Cache
Mem
ory C
ontrolle
r
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Mem
ory C
ontrolle
rM
em
ory C
ontrolle
r
Mem
ory C
ontroller
Mem
ory C
ontroller
Mem
ory C
ontroller
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
GPC
SM
Raster Engine
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
SM
Polymorph Engine
Polymorph Engine
Host Interface
GigaThread Engine
32 cores per SM
64KB configurable
L1 cache / shared memory
32,768 32-bit registers
Fermi: CUDA Core* 27
Decoupled floating point and integer data paths
Double Fused-multiply-add (FMA)
Integer operations optimized for extended precision
DP throughput is 50% of SP throughput
DP: 256 FMA ops /clock
SP: 512 FMA ops /clock
*http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
Kepler: the new SMX
Consumer:
GTX680, GTX780, GTX-Titan
HPC
Tesla K10..K40
SMX features
192 CUDA cores
32 in Fermi
32 Special Function Units (SFU)
4 for Fermi
32 Load/Store units (LD/ST)
16 for Fermi
3x Perf/Watt improvement
28
A comparison 29
Maxwell: the newest SMM
Consumer:
GTX 970, GTX 980, …
HPC:
?
SMM Features:
4 subblocks of 32 cores
Dedicated L1/LM per 64 cores
Dispatch/ecode/registers per 32 cores
L2 cache: 2MB (~3x vs. Kepler)
40 texture units
Lower power consumption
30
Hardware performance 31
Hardware Performance metrics
Clock frequency [GHz] = absolute hardware speed
Memories, CPUs, interconnects
Operational speed [GFLOPs]
Instructions per cycle + frequency
Memory bandwidth [GB/s]
differs a lot between different memories on chip
Power [Watt]
Derived metrics
FLOP/Byte, FLOP/Watt
32
Theoretical peak performance
Peak = chips * cores * threads/core * vector_lanes *
FLOPs/cycle * clockFrequency
Examples from DAS-4:
Intel Core i7 CPU
2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs
NVIDIA GTX 580 GPU
1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * 1.544 GhZ = 1581 GFLOPs
ATI HD 6970
1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle
* 0.880 GHz = 2703 GFLOPs
33
DRAM Memory bandwidth
Throughput =
memory bus frequency * bits per cycle * bus
width
Memory clock != CPU clock!
In bits, divide by 8 for GB/s
Examples:
Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s
NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s
ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176
GB/s
34
Memory bandwidths
On-chip memory can be orders of magnitude faster
Registers, shared memory, caches, …
E.g., AMD HD 7970 L1 cache achieves 2 TB/s
Off-chip memories: depends on the interconnect
Intel’s technology: QPI (Quick Path Interconnect)
25.6 GB/s
AMD’s technology: HT3 (Hyper Transport 3) 19.2 GB/s
Accelerators: PCI-e 2.0 8 GB/s
35
Power
Chip manufactures specify Thermal Design Power
(TDP)
We can measure dissipated power
Whole system
Typically (much) lower than TDP
Power efficiency
FLOPS / Watt
Examples (with theoretical peak and TDP)
Intel Core i7: 154 / 160 = 1.0 GFLOPs/W
NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W
ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W
36
Summary
Cores Threads/ALUs GFLOPS Bandwidth
Sun Niagara 2 8 64 11.2 76
IBM BG/P 4 8 13.6 13.6
IBM Power 7 8 32 265 68
Intel Core i7 4 16 85 25.6
AMD Barcelona 4 8 37 21.4
AMD Istanbul 6 6 62.4 25.6
AMD Magny-Cours 12 12 125 25.6
Cell/B.E. 8 8 205 25.6
NVIDIA GTX 580 16 512 1581 192
NVIDIA GTX 680 8 1536 3090 192
AMD HD 6970 384 1536 2703 176
AMD HD 7970 32 2048 3789 264
Intel Xeon Phi 7120 61 240 2417 352
Absolute hardware performance
Only achieved in the optimal conditions:
Processing units 100% used
All parallelism 100% exploited
All data transfers at maximum bandwidth
In real life, there are no applications like this
Can we reason about “real” performance?
38
Optional assignment 39
Compute and fill in the numbers in the table
with the CPU and GPU from your machine.
Compute the FLOPs/BW as well
Compute the numbers and fill in the table for
your dream GPU
Please send me your answers (just the added
lines) by Thursday @ 11:00 at
Amdahl’s Law
Operational Intensity and the Roofline
model
Performance analysis 40
Software performance metrics (3
P’s)
Performance
Execution time
Speed-up
Computational throughput (GFLOP/s)
Computational efficiency (i.e., utilization)
Bandwidth (GB/s)
Memory efficiency (i.e., utilization)
Productivity and Portability
Programmability
Production costs
Maintenance costs
41
Reason early about performance 42
Amdahl’s law:
s = fraction of sequential code
p = number of processors
Parallel part: assumed perfectly parallel!
How fast can it really be?
Compute achievable performance
Amdhal’s Law in pictures
RGB to gray
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x];
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
45
Performance evaluation 46
Measure execution time : Tpar
Absolute performance
Calculate speed-up : S = Tseq / Tpar
Relative performance
Does not take application into account!
Execution time and speedup can be used to
compare implementations of the same
algorithm!
Performance measurement setup
Image sizes:
Select at least 7 different images
Order them increasingly
Run the code 10 times per image
Assume outliers are eliminated
Ts = average 10 sequential runs
Choose different p’s:
Tp = average 10 parallel runs
Tp_par = execution time for the parallel part
Tp_seq = execution time for the sequential part (should be the same)
Report execution time & speed-ups
Full application
Parallel section only
An example: execution time
0
5
10
15
20
25
30
35
Image 1 Image 2 Image 3 Image 4 Image 5 Image 6 Image 7
Ts
T2
T4
T8
T16
Same example: speed-up
0
1
2
3
4
5
6
7
8
2 4 8 16
Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 7
Strong scaling
How would you build a weak scaling
experiment?
Weak scaling: keep the same work per
compute node and increase the number
of compute nodes.
Strong scaling: keep the total workload
constant and increase the number of
cores/nodes.
Derived metrics 50
Throughput: GFLOPs = #FLOPs / Tpar
Takes application into account!
Calculate compute utilization: Ec = GFLOPs/peak *100
Bandwidth: BW = #(RD+WR) / Tpar
Takes application into account!
Calculate bandwidth utilization: Ebw = BW/peak*100
Achieved bandwidth and throughput can be used
to compare *different* algorithms.
Utilization can be used to compare *different*
(application, platform) combinations.
Performance analysis 51
Real-life performance vs. theoretical limits.
Understand bottlenecks
Perform correct optimizations
… decide when to stop fiddling with code!!!
Computing the theoretical limits is the most
difficult challenge in parallel performance
analysis
Use theoretical peak limits => low accuracy
Use application characteristics
Use the platform characteristics
Arithmetic/operational intensity
The number of operations per byte of
accessed memory
Compute-intensive?
Data-intensive?
It is an application characteristic!
Ignore “overheads”
Loop counters
Array index calculations
Branches
52
RGB to gray
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
Pixel pixel = RGB[y][x]; // 3-byte structure
gray[y][x] =
0.30 * pixel.R
+ 0.59 * pixel.G
+ 0.11 * pixel.B;
}
}
53
2 x ADD, 3 x MUL = 5 Ops
1 x RD, 1 x WR => 4 bytes of memory accessed
OI = 5/4 = 1.25
Many-core platforms
Cores
Threads or ALUs GFLOPS Bandwidth
FLOPs/Byte
Sun Niagara 2 8 64 11.2 76 0.1
IBM bg/p 4 8 13.6 13.6 1.0
IBM Power 7 8 32 265 68 3.9
Intel Core i7 4 16 85 25.6 3.3
AMD Barcelona 4 8 37 21.4 1.7
AMD Istanbul 6 6 62.4 25.6 2.4
AMD Magny-Cours 12 12 125 25.6 4.9
Cell/B.E. 8 8 205 25.6 8.0
NVIDIA GTX 580 16 512 1581 192 8.2
NVIDIA GTX 680 8 1536 3090 192 16.1
AMD HD 6970 384 1536 2703 176 15.4
AMD HD 7970 32 2048 3789 264 14.4
Intel Xeon Phi 7120 61 240 2417 352 6.9
Compute or memory intensive?
RGB to Gray
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Sun Niagara 2IBM bg/p
IBM Power 7Intel Core i7
AMD BarcelonaAMD Istanbul
AMD Magny-CoursCell/B.E.
NVIDIA GTX 580NVIDIA GTX 680
AMD HD 6970Intel Xeon Phi 7120Intel Xeon Phi 3120
55
“A multi-/many-core processor is a
device built to turn a compute-intensive
application into a memory-intensive
one”
Kathy Yelick, UC Berkeley
Applications OI
Operational Intensity
O( N ) O( log(N) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTs
Dense Linear Algebra
(BLAS3)
Particle Methods
56
Attainable GFlops/sec
= min(Peak Floating-Point Performance,
Peak Memory Bandwidth * Operational
Intensity)
Peak iff OIapp ≥ PeakFLOPs/PeakBW
Compute-intensive iff OIapp ≥ (FLOPs/Byte)platform
Memory-intensive iff OIapp < (FLOPs/Byte)platform
Attainable performance 58
Compute intensive
Memory intensive
Attainable GFlops/sec
= min(Peak Floating-Point Performance,
Peak Memory Bandwidth * Operational Intensity)
Example: RGB-to-Gray
AI = 1.25
NVIDIA GTX680 P = min ( 3090, 1.25 * 192) = 240 GFLOPs
Only 7.8% of the peak
Intel Xeon Phi P = min ( 2417, 1.25 * 352) = 440 GFLOPs
Only 18.2% of the peak
Attainable performance 59
Compute intensive
Memory intensive
The Roofline model
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
60
Roofline: comparing architectures
AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9
61
Roofline: computational ceilings
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
62
Roofline: bandwidth ceilings
AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17
63
Roofline: optimization regions 64
Use the Roofline model
Determine what to do first to gain performance
Increase memory streaming rate
Apply in-core optimizations
Increase arithmetic intensity
Reader
Samuel Williams, Andrew Waterman, David
Patterson
“Roofline: an insightful visual performance model
for multicore architectures”
65