31
CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

Embed Size (px)

Citation preview

Page 1: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Medical Image RegistrationA Quick Win

Richard Ansorge

Page 2: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

The problem

• CT, MRI, PET and Ultrasound produce 3D volume images

• Typically 256 x 256 x 256 = 16,777,216 image voxels.

• Combining modalities (inter modality) gives extra information.

• Repeated imaging over time same modality, e.g. MRI, (intra modality) equally important.

• Have to spatially register the images.

Page 3: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Example – brain lesion

CT MRI PET

Page 4: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

PET-MR Fusion

The PET image shows metabolic activity.

This complements the MR structural information

Page 5: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Registration Algorithm

Transform Im B to

match Im A

Im AIm A

Im B′

Im B

Compute Cost

Function

Done

Update transform

parameters

Yes

No

good fit?

NB Cost function calculation dominates for 3D images and is inherently parallel

Page 6: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Transformations

11 12 13 14

21 22 23 24

31 32 33 34

0 0 0 1

a a a aa a a aa a a a

æ ö÷ç ÷ç ÷ç ÷ç ÷ç ÷÷ç ÷ç ÷ç ÷ç ÷÷çè ø

General affine transform has 12 parameters:

Polynomial transformations can be useful for e.g. pin-cushion type distortions:

2 2 211 12 13 14 1 2 3 4 5 6

x a x a y a z a bx bxy by bz bxz byzyz

¢= + + + + + + + + +¢=¢=

LK

Local, non-linear transformations, e.g using cubic BSplines, increasingly popular, very computationally demanding.

Page 7: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

We tried this before6 Parameter Rigid Registration - done 8 years ago

0

200

400

600

800

1000

1200

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Number of Processors

Tim

e/s

ec

s

0

8

16

24

32

40

48

56

64

Sp

eed

up

Fac

tor

SR2201 PC 333MHz Speedup perfect scaling

Page 8: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Now - Desktop PC - Windows XP

Needs 400 W power supply

Page 9: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Free Software: CUDA & Visual C++ Express

Page 10: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Visual C++ SDK in action

Page 11: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Visual C++ SDK in action

Page 12: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Architecture

Page 13: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

9600 GT Device Query

Current GTX 280 has 240 cores!

Page 14: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Matrix Multiply from SDK

NB using 4-byte floats

Page 15: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Matrix Multiply (from SDK)

GPU v CPU for NxN Matrix Multipy

0

50

100

150

200

250

300

350

400

0 1024 2048 3072 4096 5120 6144

N

GP

U S

pee

du

p

average speedup

Page 16: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Matrix Multiply (from SDK)

GPU v CPU for NxN Matrix Multipy

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

0 1024 2048 3072 4096 5120 6144

N

GP

U S

pee

du

p

speedup average speedup

Page 17: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Matrix Multiply (from SDK)

GPU v CPU for NxN Matrix Multipy

0

100

200

300

400

500

600

700

800

0 1024 2048 3072 4096 5120 6144

N

GP

U S

pee

du

p

0

5

10

15

20

25

30

35

40

spee

d /

mad

s/n

s o

r m

ads/

100

ns

speedup CPU mads/100 ns GPU mads/ns

Page 18: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Image Registration

CUDA Code

Page 19: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

#include <cutil_math.h>

texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform

// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){

int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here

source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required

}b[iy*nx+ix]=cost; // store thread sum for host

}

Page 20: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

#include <cutil_math.h>

texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform

// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){

int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here

source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required

}b[iy*nx+ix]=cost; // store thread sum for host

}

texture<float, 3, cudaReadModeElementType> tex1;

__constant__ float c_aff[16];

tex1: moving image, stored as 3D texturec_aff: affine transformation matrix, stored as constants

Page 21: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

#include <cutil_math.h>

texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform

// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){

int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here

source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required

}b[iy*nx+ix]=cost; // store thread sum for host

}

// device function declaration __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s)

nx, ny & nz: image dimensions (assumed same of both)b: output array for partial sumss: reference image (mislabelled in code)

Page 22: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

#include <cutil_math.h>

texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform

// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){

int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here

source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required

}b[iy*nx+ix]=cost; // store thread sum for host

}

int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zero

Which thread am I? (similar to MPI) however one thread for each x-y pixel, 240x256=61440 threads (CF ~128 nodes for MPI)

Page 23: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

#include <cutil_math.h>

texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform

// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){

int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here

source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required

}b[iy*nx+ix]=cost; // store thread sum for host

}

float4 v = make_float4(x,y,z,1.0f); float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f; // accumulates cost function contributionsv.z=0.0f; // z of first slice is zero (redundant as done above) uint is = iy*nx+ix; // this is index of my voxel in first z-sliceuint istep = nx*ny; // stride to index same voxel in subsequent slices

Initialisations and first matrix multiply. “v” is 4-vector current voxel x,y,z address“tx,ty,tz” hold corresponding transformed position

Page 24: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

#include <cutil_math.h>

texture<float, 3, cudaReadModeElementType> tex1; // Target Image in texture__constant__ float c_aff[16]; // 4x4 Affine transform

// Function arguments are image dimensions and pointers to output buffer b// and Source Image s. These buffers are in device memory __global__ void d_costfun(int nx,int ny,int nz,float *b,float *s){

int ix = blockIdx.x*blockDim.x + threadIdx.x; // Thread ID matchesint iy = blockIdx.y*blockDim.y + threadIdx.y; // Source Image x-yfloat x = (float)ix;float y = (float)iy;float z = 0.0f; // start with slice zerofloat4 v = make_float4(x,y,z,1.0f);float4 r0 = make_float4(c_aff[ 0],c_aff[ 1],c_aff[ 2],c_aff[ 3]);float4 r1 = make_float4(c_aff[ 4],c_aff[ 5],c_aff[ 6],c_aff[ 7]);float4 r2 = make_float4(c_aff[ 8],c_aff[ 9],c_aff[10],c_aff[11]);float4 r3 = make_float4(c_aff[12],c_aff[13],c_aff[14],c_aff[15]); // 0,0,0,1?float tx = dot(r0,v); // Matrix Multiply using dot productsfloat ty = dot(r1,v);float tz = dot(r2,v);float source = 0.0f;float target = 0.0f;float cost = 0.0f;uint is = iy*nx+ix;uint istep = nx*ny;for(int iz=0;iz<nz;iz++) { // process all z's in same thread here

source = s[is];target = tex3D(tex1, tx, ty, tz);is += istep;v.z += 1.0f; tx = dot(r0,v);ty = dot(r1,v);tz = dot(r2,v);cost += fabs(source-target); // other costfuns here as required

}b[iy*nx+ix]=cost; // store thread sum for host

}

for(int iz=0;iz<nz;iz++) { // process all z's in same thread here source = s[is]; target = tex3D(tex1, tx, ty, tz); // NB very FAST trilinear interpolation!! is += istep; v.z += 1.0f; // step to next z slice tx = dot(r0,v); ty = dot(r1,v); tz = dot(r2,v); cost += fabs(source-target); // other costfuns here as required}b[iy*nx+ix]=cost; // store thread sum for host

Loop sums contributions for all z values at fixed x,y position. Each tread updates a different element of 2D results array b.

Y

X

Z

Page 25: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Host Code Initialization Fragment

...blockSize.x = blockSize.y = 16; // multiples of 16 a VERY good ideagridSize.x = (w2+15) / blockSize.x;gridSize.y = (h2+15) / blockSize.y;

// allocate working buffers, image is W2 x H2 x D2cudaMalloc((void**)&dbuff,w2*h2*sizeof(float)); // passed as “b” to kernelbufflen = w2*h2;Array1D<float> shbuff = Array1D<float>(bufflen);shbuff.Zero();hbuff = shbuff.v;

cudaMalloc((void**)&dnewbuff,w2*h2*d2*sizeof(float)); //passed as “s” to kernelcudaMemcpy(dnewbuff,vol2,w2*h2*d2*sizeof(float),cudaMemcpyHostToDevice);

e = make_float3((float)w2/2.0f,(float)h2/2.0f,(float)d2/2.0f); // fixed rotation origino = make_float3(0.0f); // translationsr = make_float3(0.0f); // rotationss = make_float3(1.0f,1.0f,1.0f); // scale factorst = make_float3(0.0f); // tans of shears...

Page 26: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Calling the Kernel double nr_costfun(Array1D<double> &a) {

static Array2D<float> affine = Array2D<float>(4,4); // a holds current transformationdouble sum = 0.0;make_affine_from_a(nr_fit,affine,a); // convert to 4x4 matrix of floats

cudaMemcpyToSymbol(c_aff,affine.v[0],4*4*sizeof(float)); // load constant memd_costfun<<<gridSize, blockSize>>>(w2,h2,d2,dbuff,dnewbuff); // run kernelCUT_CHECK_ERROR("kernel failed"); // OK?cudaThreadSynchronize(); // make sure all done

// copy partial sums from device to hostcudaMemcpy(hbuff,dbuff,bufflen*sizeof(float),cudaMemcpyDeviceToHost);

for(int iy=0;iy<h2;iy++) for(int ix=0;ix<w2;ix++) sum += hbuff[iy*w2+ix]; // final sumcalls++;if(verbose>1){

printf("call %d costfun %12.0f, a:",calls,sum);for(int i=0;i<a.sizex();i++)printf(" %f",a.v[i]);printf("\n");

}return sum;

}

Page 27: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Example Run (240x256x176 images)C: >airwcairwc v2.5 Usage: AirWc <target> <source> <result> opts(12rtdgsf)

C:>airwc sb1 sb2 junk 1fNIFTI Header on File sb1.niiconverting short to float 0 0.000000NIFTI Header on File sb2.niiconverting short to float 0 0.000000

Using device 0: GeForce 9600 GT

Initial correlation 0.734281using cost function 1 (abs-difference)using cost function 1 (abs-difference)Amoeba time: 4297, calls 802, cost:127946102

Cuda Total time 4297, Total calls 802File dofmat.mat writtenNifti file junk.nii written, bswop=0Full Time 6187

timer 0 1890 mstimer 1 0 mstimer 2 3849 mstimer 3 448 mstimer 4 0 msTotal 6.187 secsFinal Transformation: 0.944702 -0.184565 0.017164 40.637428 0.301902 0.866726 -0.003767 -38.923237 -0.028792 -0.100618 0.990019 18.120852 0.000000 0.000000 0.000000 1.000000Final rots and shifts 6.096217 -0.156668 -19.187197 -0.012378 0.072203 0.122495scales and shears 0.952886 0.912211 0.995497 0.150428 -0.101673 0.009023

Page 28: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Desktop 3D Registration

Registration with

CUDA6 Seconds

Registration with

FLIRT 4.18.5 Minutes

Page 29: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Comments

• This is actually already very useful. Almost interactive (add visualisation)

• Further speedups possible– Faster card – Smarter optimiser– Overlap IO and Kernel execution– Tweek CUDA code

• Extend to non-linear local registration

Page 30: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Intel Larabee?

Figure 1: Schematic of the Larabee many-core architecture: The number of CPU cores and the number and type of co-processors and I/O blocks are implementation-dependent, as are the positions of the CPU and non-CPU blocks on the chip.

Porting from CUDA to Larabee should be easy

Page 31: CUDA Image Registration 29 Oct 2008 Richard Ansorge Medical Image Registration A Quick Win Richard Ansorge

CUDA Image Registration 29 Oct 2008 Richard Ansorge

Thank you