GPU multiprocessing
Manuel Ujaldn MartnezComputer Architecture Department
University of Malaga (Spain)
Outline
1. Multichip solutions [10 slides]2. Multicard solutions [2 slides]3. Multichip + multicard [3]4. Performance on matrix decompositions [2]5. CUDA programming [5]6. Scalability on 3DFD [4]
3
A world of possibilities
From lower to higher cost, we have:1. Multichip: Voodoo5 (3Dfx), 3D1 (Gigabyte).
2. Multicard: SLI(Nvidia) / CrossFire(ATI).
3. Combination:Two chips/card and/or two cards/connector.
Gigabyte (2005)NVIDIA(2007) ATI
(2007)
Evans &Sutherland (2004):
NVIDIA(2008)
I. Multichip solutions
4
5
First choice: Multichip. A retrospective:
Voodoo 5 55003Dfx(1999)
Rage FuryMaxxATI(2000)
Volari V8DuoXGI(2002)
2 Rad9800(prototype)Sapphire(2003)
6
First choice: Multichip.Example 1: 3D1 (Gigabyte - 2005).
A double GeForce 6600GT GPU on the same card (december 2005).
Each GPU endowed with 128 MB of memory and a 128 bits bus width.
7
First choice: Multichip.Example 2: GeForce 7950 GX2 (Nvidia 2006)
8
First choice: Multichip.Example 3: GeForce 9800 GX2 (Nvidia - 2008)
Double GeForce 8800 GPU, double printed circuit board and double video memory of 512 MB. A single PCI-express connector.
9
First choice: Multichip.3D1 (Gigabyte). Cost and performance
Card
GeForce 6600 GT
3D1 using a single GPU
GeForce 6800 GT
GeForce 6600 GT SLI
3D1 using two GPUs
3DMark 20033DMark 2003 3DMark 20053DMark 2005
1024x768 1600x1200 1024x768 1600x1200
8234 2059 3534 2503
8529 2063 3572 2262
11493 3846 4858 3956
14049 3924 6122 3542
14482 4353 6307 3609
Cost: row 3 > row 4 > row 5 > row 1 > row 1
10
First choice: Multichip.3D1 (Gigabyte). Analysis.
As compared to a single GeForce 6800 GT, 3D1 has: Lower cost. Higher arithmetic performance. Better at poorer resolution and
software innovations (shaders). Similar bandwidth. Lower memory space and usability:
Vertices and textures must be replicated. A GPU cannot see the memory of its twin.
As compared to two GeForce 6600 GT connected through SLI: Slightly lower cost. Greater performance without demanding CPU bandwidth. Less versatile: Future expansion and/or single-card use.
11
First choice: Multichip. GeForce 7950 GX2 (2006)
GPU developed by Nvidia in June 2006. The GPU has twin soul (duality affects design).
Clocks are slower than the single-GPU model: GPU: 500 MHz (twin) versus 650 MHz (stand alone). Memory: 2x600 MHz (twin) versus 2x800 MHz (stand alone).
Drivers were released almost a year later, which penalized initially the popularity of this card.
It allows to use 48 pixel processors (24 on each GPU) and a video memory of 1 GB (512 MB connected to each GPU through a couple of buses 256 bits wide).
12
First choice: Multichip (2006). Transistors.
A smaller chip with smaller transistors allows growing through a GPU replication
13
First choice: Multichip (2006). Frequency.
A double GPU allows to relax clocks, with less heat and power consumption.
14
First choice: Multichip (2006). Bandwidth.
Two GPUs placed on parallel planes make it easier to duplicate the bus width to 512 bits.
II. Multicard solutions
15
16
Second choice: Multicard.A couple of GPUs
SLI (Nvidia on GeForces) CrossFire (ATI on Radeons)
17
Second choice: Multicard.SLI (Nvidia). Elements.
- The motherboard must have several slots PCI-express 2.0 and PCI-express x16:
- The power supply must reach at least 700 Watts.
- Performance issues: A twin card may increment performance 60-80%. A new generation of GPUs may increment even more. Time frame becomes crucial!
III. Multichip + multicard
18
19
1+2 choice: Multichip+multicard
First solution available on the marketplace: Gigabyte (2005) based on GeForce 6 GPUs. It allows heterogeneous graphics cards, but workload balance gets complicated.
20
1+2 choice: Multichip+multicard.Implementation details
2 GPUs 8 GPUs4 GPUs
1+2 choice: Multichip+multitarjeta.Newer designs
It combines a number of GeForce 9800 GX2 GPUs and a multi-socket motherboard to configure up to quad-SLI: 2 GPUs/card x up to 4 cards = 8 GPUs.
21
IV. Performance on matrix decompositions
22
23
Multicard performance versus a newer generation (LU decomposition)
A second (twin) GPU improves 1.6x, but does not reach the performance of a single card coming from the next generation.
24
CPU+GPU performance versus a single quad-core CPU (more on this later)
The benchmark is composed of three popular matrix decompositions used in linear algebra
V. CUDA programming for multi-GPU applications
25
26
Device Management
CPU can query and select GPU devices cudaGetDeviceCount( int *count ) cudaSetDevice( int device ) cudaGetDevice( int *current_device ) cudaGetDeviceProperties( cudaDeviceProp* prop, int device ) cudaChooseDevice( int *device, cudaDeviceProp* prop )
Multi-GPU setup: device 0 is used by default one CPU thread can control only one GPU
multiple CPU threads can control the same GPU calls are serialized by the driver
41
27
Multiple CPU Threads and CUDA
CUDA resources allocated by a CPU thread can be consumed only by CUDA calls from the same CPU thread.
Violation example: CPU thread 2 allocates GPU memory, stores address in p thread 3 issues a CUDA call that accesses memory via p
42
When using several GPUs, the implementation gets complicated
GPUs dont share video memory, so programmer must move data around PCI-express (even when GPUs belong to the same graphics card, as in the GeForce 9800 GX2).
Steps to follow: Copy data from GPU A to CPU thread A. Copy data from CPU thread A to CPU thread B using MPI. Copy data from CPU thread B to GPU B.
We can use asynchronous copies to overlap the kernel execution on the GPU with data copies, and pinned memory to share copies among CPU threads (use cudaHostAlloc())
28
29
Host Synchronization
All kernel launches are asynchronous control returns to CPU immediately kernel executes after all previous CUDA calls have completed
cudaMemcpy is synchronous control returns to CPU after copy completes copy starts after all previous CUDA calls have completed
cudaThreadSynchronize() blocks until all previous CUDA calls complete
39
30
CPUGPU interactions: Conclusions
CPUGPU mem BW much lower than GPU mem BW. Use page-locked host memory (cudaMallocHost()) for
maximum CPU GPU bandwidth 3.2 GB/s common on PCI-e x16. ~4 GB/s measured on nForce 680i chipsets (8 GB/s for PCI-e 2.0). Be cautious however since allocating too much page-locked memory can
reduce overall system performance.
Minimize CPU GPU data transfers by moving more code from CPU to GPU: Even if that means running kernels with low parallelism. Intermediate data structs. can be allocated, operated on, and
deallocated without ever copying them to CPU memory.
Group data transfers: One large transfer much better than many small ones.
VI. Scalability for 3DFD (Nvidia code)
31
Example: Multi-GPU implementation for 3DFD
3DFD is a finite differences code for the discretization of the seismic wave equation. 8th order in space, 2nd order in time. Using a regular mesh.
Fixed X and Y dimensions, varying Z. Data is partitioned among GPUs along Z axis.
Computation increases with z, communication (per node) stays constant.
A GPU has to exchange 4 xy-planes (ghost nodes) with each of its neighbors.
Executed on a cluster of 2 GPUS per node and Infiniband SDR network.
32
Performance for a couple of GPUs
Linear scaling is achieved when computation time exceeds communication time.
33
Three or more cluster nodes
Times are per cluster node. At least one cluster node needs two MPI communications,
one with each of the neighbors.
34
Performance with 8 GPUs
8x improvement factor is sustained at Z>1300, exactly where computation exceeds communication.
35