GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and

GPU-based Computing

Tesla C870 GPU

8 KB / multiprocessor

8 KB / multiprocessor

1.5 GB per GPU

16 KB

up to 768 threads (up to 768 threads (21 bytes of shared memory and 10 registers/thread))

(8192)

Tesla C870 Threads

Example

B blocks of T threads

Which element am I processing?

Case Study: Irregular Terrain Model

•each profile is 1KB

16*16 threads128*16 threads

45 regs

192*16 threads

37 regs

GPU Strategies for ITM

GPU Strategies for ITM

OpenMP vs CUDA

for (i = 0; i < 32; i++){ for (j = 0; j < 32; j++) value=some_function(A[i][j]}

#pragma omp parallel for shared(A) private(i,j)

Easy to Learn

Takes time to master

For further discussion and additional information

Tesla C870 GPU vs. CELL BEITM Execution Time for 256K Profiles

0.138

2.067

17.729

0

4

8

12

16

20

GPU 1.35GHz CELL BE 2.8GHz GPP 2.4GHz

Exec

utio

n Ti

me

(sec

)

ITM Execution Time for 256K Profiles

0.138

2.067

17.729

0

4

8

12

16

20

GPU 1.35GHz CELL BE 2.8GHz GPP 2.4GHz

Exec

utio

n Ti

me

(sec

)

3072 threads on GPU

15x faster than CELL BE128x faster than GPP

Tesla C1060: 30 cores, double amount of registers ($1,500) with 8X GFLOPS overIntel Xeon W5590 Quad Core ($1600)

• 32 cores, 1536 threads per core• No complex mechanisms

– speculation, out of order execution, superscalar, instruction level parallelism, branch predictors

• Column access to memory– faster search, read

• First GPU with error correcting code – registers, shared memory, caches, DRAM

• Languages supported – C, C++, FORTRAN, Java, Matlab, and Python

• IEEE 754’08 Double-precision floating point – fused multiply-add operations,

• Streaming and task switching (25 microseconds)– Launching multiple kernels (up to 16) simultaneously

• visualization and computing

Fermi: Next Generation GPU

Hij = max { Hi-1,j-1 + Si,j, Hi-1,j – G, Hi,j-1 – G, 0 } //G = 10

Sequence Alignment

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

A B C D F W

A 5 -2 -1 -2 -3 -3

B -2 5 -3 -3 -4 -5

C -1 -3 13 -4 -2 -5

D -2 -3 -4 8 -5 -5

F -3 -4 -2 -5 8 4

W -3 -5 -5 -5 4 15

Substituion Matrix

Cost Function

A B C D … Z

A 5 -2 -1 -2 … -3

B -2 5 -3 -3 … -5

C -1 -3 13 -4 … -5

D -2 -3 -4 8 … -5

… … … … … … …

Z -3 -5 -5 -5 … 15

Only need space for 26x26 matrix

New Cost Function

A R N D … X

Q -1 1 0 0 … -1

U -1 -1 -1 -1 … -1

E -1 0 0 2 … -1

R -1 5 0 -2 … -1

Y -2 -2 -2 -3 … -1

… … … … … … …

Previous Methods

Space needed is 23x(Query Length)

Sorted Substitution Table

Computed new table from substitution matrix with substitution characters for top row and query sequence for column*Does not use modulo

Protein Length

GPU (1.35GHz) Time (s)

SSEARCH (3.2GHz) Time(s)

Speedup GPU Cycles (Billions)

SSEARCH Cycles (Billions)

Cycles Ratio

4 1.3 10 8 1.69 32.00 18.968 1.8 12 6.8 2.38 38.40 16.11

16 2.8 26 9.2 3.83 83.20 21.7132 5.8 47 8.1 7.79 150.40 19.3064 11.2 99 8.8 15.17 316.80 20.89

128 22.0 212 9.7 29.64 678.40 22.88256 43.8 428 9.8 59.14 1369.60 23.16512 92.4 886 9.6 124.78 2835.20 22.72768 144.6 1292 8.9 195.15 4134.40 21.19

1024 279.7 1807 6.5 377.55 5782.40 15.32

Alignment Database: Swissprot (Aug 2008), containing 392,768 sequences. GSW vs SSEARCH.

Results

From Software Perspective

• A kernel is a SPMD-style• Programmer organizes threads into thread blocks;• Kernel consists of a grid of one or more blocks• A thread block is a group of concurrent threads that can

cooperate amongst themselves through – barrier synchronization– a per-block private shared memory space

• Programmer specifies – number of blocks – number of threads per block

• Each thread block = virtual multiprocessor – each thread has a fixed register– each block has a fixed allocation of per-block shared memory.

Efficiency Considerations

• Avoid execution divergence– threads within a warp follow different execution paths. – Divergence between warps is ok

• Allow loading a block of data into SM – process it there, and then write the final result back out to

external memory.

• Coalesce memory accesses– Access executive words instead of gather-scatter

• Create enough parallel work– 5K to 10K threads

Most important factor

• number of thread blocks =“processor count”

• At thread block (or SM) level– Internal communication is cheap

• Focusing on decomposing the work between the “p” thread blocks

Documents

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and