Upload
duongnhu
View
215
Download
1
Embed Size (px)
Citation preview
1
COSC 6385
Computer Architecture
- Data Level Parallelism (II)
Edgar Gabriel
Fall 2014
SIMD Instructions
• Originally developed for Multimedia applications
• Same operation executed for multiple data items
• Uses a fixed length register and partitions the carry chain to
allow utilizing the same functional unit for multiple
operations
– E.g. a 64 bit adder can be utilized for two 32-bit add
operations simultaneously
• Instructions originally not intended to be used by compiler,
but just for handcrafting specific operations in device drivers
• All elements in a register have to be on the same memory
page to avoid page faults within the instruction
2
SIMD Instructions
• MMX (Mult-Media Extension) - 1996
– Existing 64 bit floating point register could be used for
eight 8-bit operations or four 16-bit operations
• SSE (Streaming SIMD Extension) – 1999
– Successor to MMX instructions
– Separate 128-bit registers added for sixteen 8-bit, eight
16-bit, or four 32-bit operations
• SSE2 – 2001, SSE3 – 2004, SSE4 - 2007
– Added support for double precision operations
• AVX (Advanced Vector Extensions) - 2010
– 256-bit registers added
AVX Instructions AVX Instruction Description
VADDPD Add four packed double-precision operands
VSUBPD Subtract four packed double-precision operands
VMULPD Multiply four packed double-precision operands
VDIVPD Divide four packed double-precision operands
VFMADDPD Multiply and add four packed double-precision
operands
VFMSUBPD Multiply and subtract four packed double-precision
operands
VCMPxx Compare four packed double-precision operands for EQ, NEQ, LT, LTE, GT, GE…
VMOVAPD Move aligned four packed double-precision operands
VBROADCASTSD Broadcast one double-precision operand to four
locations in a 256-bit register
3
Graphics Processing Units (GPU)
• Hardware in Graphics Units similar to Vector Processors
– Works well with data-level parallel problems
– Scatter-gather transfers
– Mask registers
– Large register files
• Differences:
– No scalar processor
– Uses multithreading to hide memory latency
– Has many functional units, as opposed to a few deeply
pipelined units like a vector processor
Graphics Processing Units (II)
• Using NVIDIA GPUs as an example
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C-like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Programming model is “Single Instruction Multiple
Thread”
• GPU hardware handles thread management, not
applications or OS
4
Example: Vector Addition
• Sequential code:
int main ( int argc, char **argv )
{
int A[N], B[N], C[N];
for ( i=0; i<N; i++) {
C[i] = A[i] + B[i];
}
return (0);
}
CUDA: replace the loop by N threads
each executing on element of the vector
add operation
Example: Vector Addition (II)
• CUDA: replace the loop by N threads each executing on
element of the vector add operation
• Question: How does each thread know which elements
to execute?
– threadIdx : each thread has an id which is unique in
the thread block
• of type dim3, which is a
struct {
int x,y,z;
} dim3;
– blockDim: Total number of threads in the thread block
• a thread block can be 1D, 2D or 3D
5
Example: Vector Addition (III)
• Initial CUDA kernel:
• This code is limited by the maximum number of threads in a thread
block
– CUDA 1.3: 512 threads max.
– if vector is longer, we have to create multiple thread blocks
void vecadd ( int *d_A, int *d_B, int* d_C)
{
int i = threadIdx.x;
d_C[i] = d_A[i] + d_B[i];
return;
}
Assuming a 1-D thread block
-> only x-dimension used
How does the compiler now which code to
compile for CPU and which one for GPU?
• Specifier tells compiler where function will be executed
-> compiler can generate code for corresponding processor
• Executed on CPU, called form CPU (default if not specified)
__host__ void func(…)
• CUDA kernel to be executed on GPU, called from CPU
__global__ void func(...);
• CUDA kernel to be executed on GPU, called from GPU
__device__ void func(...);
6
Example: Vector Addition (IV)
• so the CUDA kernel is in reality:
• Note:
– d_A, d_B, and d_C are in global memory
– int i is in local memory of the thread
__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)
{
int i = threadIdx.x;
d_C[i] = d_A[i] + d_B[i];
return;
}
If you have multiple thread blocks
__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
d_C[i] = d_A[i] + d_B[i];
return;
}
ID of the thread block that
this thread is part of
Number of threads in a
thread block
7
Using more than one element per thread
__global__ void vecAdd ( int *d_A, int *d_B, int* d_C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j;
for ( j=i*NUMELEMENTS; j<(i+1)*NUMELEMENTS; j++)
d_C[j] = d_A[j] + d_B[j];
return;
}
NVIDIA Instruction Set Architecture
• Parallel Thread Execution (PTX)
– is an abstraction of the hardware instruction set
– Uses virtual registers
– Translation to machine code is performed in software
– Example for one iteration of a loop executing
y[i] = a*x[i] + y[i]
with a blocksize of 512 threads per block shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 RD0, RD0, RD4 ; Product in RD0 = RD0 * a
add.f64 RD0, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
8
Conditional Branching
• Branch hardware uses internal masks
– Branch synchronization stack to support nested branch
instructions
• Entries consist of masks for each SIMD lane (CUDA thread)
– Instruction markers to manage when a branch diverges into
multiple execution paths
• Push on divergent branch
– …and when paths converge
• Act as barriers
• Pops stack
• For equal length IF-ELSE conditions, code will operate at 50%
efficiency
– Either IF or the ELSE part is not executing
Nvidia GT200
• A GT200 is multi-core chip with two level hierarchy
– focuses on high throughput on data parallel workloads
• 1st level of hierarchy: 10 Thread Processing Clusters (TPC)
• 2nd level of hierarchy: each TPC has
– 3 Streaming Multiprocessors (SM) ( an SM corresponds to 1
core in a conventional processor)
– a texture pipeline (used for memory access)
• Global Block Scheduler:
– issues thread blocks to SMs with available capacity
– simple round-robin algorithm but taking resource
availability (e.g. of shared memory) into account
9
Nvidia GT200
Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”,
http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1
Streaming multi-processor (I)
• Instruction fetch, decode and issue logic
• 8 32bit ALU units (that are often referred to as Streaming
processor (SP) or confusingly called a ‘core’ by Nvidia)
• 8 branch units
– a thread encountering a branch will stall until it is
resolved (no speculation), branch delay: 4 cycles
• Two 64bit special units for less frequent operations
– 64bit operations 8-12 times slower than 32bit operations!
• 1 special function unit for ‘unusual’ instructions
– transcendental functions, interpolations, reciprocal
square roots
– take anywhere from 16 to 32 cycles to execute
10
Streaming multi-processor (II)
• Single issue with SIMD capabilities
• Can execute up to 8 thread blocks/1024 threads
concurrently
• Does not support speculative execution or branch prediction
• Instructions are scoreboarded to reduce stalls
• Each SP has access to 2048 register file entries each with 32
bits
– a double precision number has to utilize two adjacent
registers
– register file can be used by up to 128 threads
concurrently
Streaming multi-processor (III)
Image Source: David Kanter, “Nvidia GT200: Inside a Parallel Processor”,
http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242&p=1
11
Streaming multi-processor (IV)
• Execution units of an SM run at twice the frequency of fetch
and issue logic as well as memory and register
• 64KB register file that is partitioned across alls SPs
• 16KB shared memory that can be used for communication
between the threads running on the SPs of the same SM
– organized in 4096 entries, 16 banks ( = 32bit bank width)
– accessing shared memory is as fast as accessing a
register!
Load/Store operations • Generated in SMs, but handled by SM controller in the TPC
– load pipeline shared hardware with texture pipeline
– shared by three 3 SMs
– mutual exclusive usage of load and texture pipelines
– effective address calculation + mapping of 40byte virtual
addresses to physical address by MMU
• Texture cache:
– 2-D addressing
– read only caches without cache coherence
• entire cache hierarchy invalidated if a data item is
modified
– texture caches used to save bandwidth and power, not
really faster than texture memory
12
CUDA Memory Model
CUDA Memory Model (II)
• cudaError_t cudaMalloc(void** devPtr, size_t size)
– Allocates size bytes of device(global) memory pointed to by *devPtr
– Returns cudaSuccess for no error
• cudaError_t cudaMempy(void* dst, const void* src,
size_t count, enum cudaMemcpyKind kind)
– Dst = destination memory address
– Src = source memory address
– Count = bytes to copy
– Kind = type of transfer (“HostToDevice”, “DeviceToHost”,
“DeviceToDevice”)
• cudaError_t cudaFree(void* devPtr)
– Frees memory allocated with cudaMalloc
Slide based on a lecture by Matt Heavener, CS, State Univ. of NY at Buffalo
http://www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf
13
Example: Vector Addition (V) int main ( int argc, char ** argv) {
float a[N], b[N], c[N];
float *d_a, *d_b, *d_c;
cudaMalloc( &d_a, N*sizeof(float));
cudaMalloc( &d_b, N*sizeof(float));
cudaMalloc( &d_c, N*sizeof(float));
cudaMemcpy( d_a, a, N*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy( d_b, b, N*sizeof(float),cudaMemcpyHostToDevice);
dim3 threadsPerBlock(256); // 1-D array of threads
dim3 blocksPerGrid(N/256); // 1-D grid
vecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c);
cudaMemcpy(d_c,c, N*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return-0;
}
Nvidia Fermi processor • Next generation processors of Nvidia
• Removed one level of hierarchy
– contains 16 SM processors, but no notion of TPCs anymore
• Each SM processor has
– 32 ALU units (Nvidia ‘cores’, SIMD ‘lanes’ in the book)
• compared to 8 on the GT200
• organized as two units with 16 ALUs each
– 16 load/store units
• compared to 1 for three SMs in GT200
– 64 kb local SRAM that can be split into L1 cache and
shared memory (16kb/48kb or 48kb/16kb)
– 4 special function units
• compared to 1 in GT200
14
Nvidia Fermi SM processor
Image Source:Peter N. Glaskowsky, “Nvidia’s Fermi: The First Complete GPU Architecture”
http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf
Nvidia Fermi processor
• Can manage up 1,536 threads simultaneously per SM
– compared to 1024 per SM on the GT200
• Register file increased to 128kB (32k entries)
• New: modified address space using 40bit addresses
– global, shared and local addresses are ranges within that
address space
• New: support for atomic read-modify-write operation
• New: support for predicated instructions
15
Similarities and Differences between GPU
and Vector Processors
• Memory organization and management
– All GPU memory accesses are gather-scatter
-> special hardware to recognize address coalescing
-> hides memory latency due to large number of threads
and scoreboarding
– Loading data into vector register contiguous by default
-> special support for gather-scatter operation
-> costs of load/store operation amortized due to large
number of elements accessed at once
Similarities and Differences between GPU
and Vector Processors (II)
• Processor organization and ISA
– Vector register hold entire vector <-> vector is
distributed across registers in different ALUs on GPU
– Much higher number of ALU/threads supported in GPU
than no. of lanes in a vector processor
– PTX instruction similar to a vector instruction
– Both approaches use mask registers to handle conditional
instructions
-> mask set by compiler for vector processors
-> mask set at runtime by hardware for GPU
16
Similarities and Differences between GPU
and Vector Processors (III)
• Scalar processor executes scalar operations in vector
processor
• GPU could use the regular CPU for scalar operations
– High costs of data transfer between GPU and CPU
memory
– Scalar code often executed on GPU