Upload
alisa-hogan
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Medical Image RegistrationA Quick Win
Richard Ansorge
CUDA Image Registration 29 Oct 2008 Richard Ansorge
The problem
• CT, MRI, PET and Ultrasound produce 3D volume images
• Typically 256 x 256 x 256 = 16,777,216 image voxels.
• Combining modalities (inter modality) gives extra information.
• Repeated imaging over time same modality, e.g. MRI, (intra modality) equally important.
• Have to spatially register the images.
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Example – brain lesion
CT MRI PET
CUDA Image Registration 29 Oct 2008 Richard Ansorge
PET-MR Fusion
The PET image shows metabolic activity.
This complements the MR structural information
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Registration Algorithm
Transform Im B to
match Im A
Im AIm A
Im B′
Im B
Compute Cost
Function
Done
Update transform
parameters
Yes
No
good fit?
NB Cost function calculation dominates for 3D images and is inherently parallel
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Transformations
11 12 13 14
21 22 23 24
31 32 33 34
0 0 0 1
a a a aa a a aa a a a
æ ö÷ç ÷ç ÷ç ÷ç ÷ç ÷÷ç ÷ç ÷ç ÷ç ÷÷çè ø
General affine transform has 12 parameters:
Polynomial transformations can be useful for e.g. pin-cushion type distortions:
2 2 211 12 13 14 1 2 3 4 5 6
x a x a y a z a bx bxy by bz bxz byzyz
¢= + + + + + + + + +¢=¢=
LK
Local, non-linear transformations, e.g using cubic BSplines, increasingly popular, very computationally demanding.
CUDA Image Registration 29 Oct 2008 Richard Ansorge
We tried this before6 Parameter Rigid Registration - done 8 years ago
0
200
400
600
800
1000
1200
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Number of Processors
Tim
e/s
ec
s
0
8
16
24
32
40
48
56
64
Sp
eed
up
Fac
tor
SR2201 PC 333MHz Speedup perfect scaling
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Now - Desktop PC - Windows XP
Needs 400 W power supply
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Free Software: CUDA & Visual C++ Express
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Visual C++ SDK in action
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Visual C++ SDK in action
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Architecture
CUDA Image Registration 29 Oct 2008 Richard Ansorge
9600 GT Device Query
Current GTX 280 has 240 cores!
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply from SDK
NB using 4-byte floats
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0
50
100
150
200
250
300
350
400
0 1024 2048 3072 4096 5120 6144
N
GP
U S
pee
du
p
average speedup
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
0 1024 2048 3072 4096 5120 6144
N
GP
U S
pee
du
p
speedup average speedup
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Matrix Multiply (from SDK)
GPU v CPU for NxN Matrix Multipy
0
100
200
300
400
500
600
700
800
0 1024 2048 3072 4096 5120 6144
N
GP
U S
pee
du
p
0
5
10
15
20
25
30
35
40
spee
d /
mad
s/n
s o
r m
ads/
100
ns
speedup CPU mads/100 ns GPU mads/ns
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Image Registration
CUDA Code
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
texture<float, 3, cudaReadModeElementType> tex1;
__constant__ float c_aff[16];
tex1: moving image, stored as 3D texturec_aff: affine transformation matrix, stored as constants
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
// device function declaration __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s)
nx, ny & nz: image dimensions (assumed same of both)b: output array for partial sumss: reference image (mislabelled in code)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zero
Which thread am I? (similar to MPI) however one thread for each x-y pixel, 240x256=61440 threads (CF ~128 nodes for MPI)
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f; // accumulates cost function contributionsv.z=0.0f; // z of first slice is zero (redundant as done above) uint is = iy*nx+ix; // this is index of my voxel in first z-sliceuint istep = nx*ny; // stride to index same voxel in subsequent slices
Initialisations and first matrix multiply. “v” is 4-vector current voxel x,y,z address“tx,ty,tz” hold corresponding transformed position
CUDA Image Registration 29 Oct 2008 Richard Ansorge
#include <cutil_math.h>
texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform
// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){
int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here
source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required
}b[iy*nx+ix]=cost; // store thread sum for host
}
for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); // NB very FAST trilinear interpolation!! is += istep; v.z += 1.0f; // step to next z slice tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required}b[iy*nx+ix]=cost; // store thread sum for host
Loop sums contributions for all z values at fixed x,y position. Each tread updates a different element of 2D results array b.
Y
X
Z
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Host Code Initialization Fragment
...blockSize.x = blockSize.y = 16; // multiples of 16 a VERY good ideagridSize.x = (w2+15) / blockSize.x;gridSize.y = (h2+15) / blockSize.y;
// allocate working buffers, image is W2 x H2 x D2cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as “b” to kernelbufflen = w2*h2;Array1D<float> shbuff = Array1D<float>(bufflen);shbuff.Zero();hbuff = shbuff.v;
cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as “s” to kernelcudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice);
e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origino = make_float3(0.0f); // translationsr = make_float3(0.0f); // rotationss = make_float3(1.0f,1.0f,1.0f); // scale factorst = make_float3(0.0f); // tans of shears...
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Calling the Kernel double nr_costfun(Array1D<double> &a) {
static Array2D<float> affine = Array2D<float>(4,4); // a holds current transformationdouble sum = 0.0;make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats
cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant memd_costfun<<<gridSize, blockSize>>>(w2,h2,d2,dbuff,dnewbuff); // run kernelCUT_CHECK_ERROR("kernel failed"); // OK?cudaThreadSynchronize(); // make sure all done
// copy partial sums from device to hostcudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost);
for(int iy=0;iy<h2;iy++) for(int ix=0;ix<w2;ix++) sum += hbuff[iy*w2+ix]; // final sumcalls++;if(verbose>1){
printf("call %d costfun %12.0f, a:",calls,sum);for(int i=0;i<a.sizex();i++)printf(" %f",a.v[i]);printf("\n");
}return sum;
}
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Example Run (240x256x176 images)C: >airwcairwc v2.5 Usage: AirWc <target> <source> <result> opts(12rtdgsf)
C:>airwc sb1 sb2 junk 1fNIFTI Header on File sb1.niiconverting short to float 0 0.000000NIFTI Header on File sb2.niiconverting short to float 0 0.000000
Using device 0: GeForce 9600 GT
Initial correlation 0.734281using cost function 1 (abs-difference)using cost function 1 (abs-difference)Amoeba time: 4297, calls 802, cost:127946102
Cuda Total time 4297, Total calls 802File dofmat.mat writtenNifti file junk.nii written, bswop=0Full Time 6187
timer 0 1890 mstimer 1 0 mstimer 2 3849 mstimer 3 448 mstimer 4 0 msTotal 6.187 secsFinal Transformation: 0.944702 -0.184565 0.017164 40.637428 0.301902 0.866726 -0.003767 -38.923237 -0.028792 -0.100618 0.990019 18.120852 0.000000 0.000000 0.000000 1.000000Final rots and shifts 6.096217 -0.156668 -19.187197 -0.012378 0.072203 0.122495scales and shears 0.952886 0.912211 0.995497 0.150428 -0.101673 0.009023
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Desktop 3D Registration
Registration with
CUDA6 Seconds
Registration with
FLIRT 4.18.5 Minutes
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Comments
• This is actually already very useful. Almost interactive (add visualisation)
• Further speedups possible– Faster card – Smarter optimiser– Overlap IO and Kernel execution– Tweek CUDA code
• Extend to non-linear local registration
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Intel Larabee?
Figure 1: Schematic of the Larabee many-core architecture: The number of CPU cores and the number and type of co-processors and I/O blocks are implementation-dependent, as are the positions of the CPU and non-CPU blocks on the chip.
Porting from CUDA to Larabee should be easy
CUDA Image Registration 29 Oct 2008 Richard Ansorge
Thank you