19
1 PROJECT REPORT APPLIED PARALLEL PROGRAMMING Advanced MRI Reconstruction Final Project Report Abdallah Abu Ghazaleh Gokul Subramani Gurudutt Ramaprasad Karthik Manivannan Sridhar Panneer selvam

Final Project Report MRI Reconstruction

Embed Size (px)

Citation preview

Page 1: Final Project Report MRI Reconstruction

1

PROJECT REPORT

APPLIED PARALLEL PROGRAMMING

Advanced MRI Reconstruction

Final Project Report

Abdallah Abu – Ghazaleh

Gokul Subramani

Gurudutt Ramaprasad

Karthik Manivannan

Sridhar Panneer selvam

Page 2: Final Project Report MRI Reconstruction

2

INTRODUCTION

Modern GPUs are increasingly becoming an attractive platform for speeding up applications

ranging from something as huge as black hole simulations to DNA sequencing. The

innumerable application which gets speeded up due to CUDA is expanding every other day.

In order to learn and experience the effectiveness of CUDA, we took up the challenge of

speeding up computations involved in the Advanced (Magnetic Resonance Imaging) MRI

Reconstruction.

In this paper, we find an approach to accelerate the working of MRI image reconstruction

algorithm using CUDA C with the help of Graphics Processing Unit(GPU) – GTX-480. The

raw image data that is captured by the coils as the K Space data needs to be processed before

the doctor can actually interpret the results of the MRI scan. This processing time takes a few

hours for the best algorithm implemented on CPU. It was observed that the algorithm

contained portions that exhibited high parallelism. These parallel data computations could be

computed at a much faster rate on the GPU. The bottlenecks of the MRI image reconstruction

algorithm were identified and these functions are implemented in CUDA C. This paper focuses

on the approach and the techniques used to convert the Matlab version of the algorithm into a

highly parallel CUDA C version.

MOTIVATION

The motivation for this project stems from the fact that MRI scan can be a long and

uncomfortable experience for the patients requiring them to lie down in the machine to closely

an hour unmoved. This project with the guidance of Prof. Dr. John Sartori and Prof. Mehmet

Akcakaya, aims at reducing the time required for the scan and the computations involved by

utilizing the GPU demon. Dr. Mehmet Akcakaya is currently exploring an algorithm called

LOST (LOw dimensional-structure Self learning and Thresholding), which adaptively finds a

sparse representation of the given image, using the various features in the image rather than

the pre-determined fixed transform domain.

The algorithm consists of two phases, one phase is responsible for the de-noising the image

for accuracy and another phase does the fourier transform for reconstruction.

We proposed that the scope for parallelism is huge during computation of fft and ifft, involving

an input size (4D data) as huge as 256*232*88*32. We planned to use the inbuilt CUDA

library, specialised for fft – “Cufft” to parallely optimise the execution of the transformation.

Page 3: Final Project Report MRI Reconstruction

3

DESIGN OVERVIEW:

The Discrete Fourier transform maps a complex valued vector into its frequency domain

representation given by

Figure 1. MRI Reconstruction overview

The Cufft library uses an algorithm with an O(NlogN), for an input data size of N. Out of the

various forms of fft, we used C2C(Complex input to complex output) fft. GTX - 480 processor

was used for running the code. The device capabilities are

Compute capability - 2.0

Micro architecture – Fermi

Maximum number of threads/block - 1024

Maximum amount of shared memory – 48kB

Constant Memory size – 64kB

Global Memory – 1.46GB DRAM

Pitched Memory – 2.0GB

Cufft vs fftw : It was found in a study from University of Waterloo that Cufft is good for

larger sized fft implementation. Cufft starts to perform better than fftw around data sizes of

Page 4: Final Project Report MRI Reconstruction

4

8192 elements. Therefore it can be safe to assume that Cufft works better than fftw for input

size greater than (~10,000 elements).

PROFILE SUMMARY of Matlab Code:

Figure 2. Profile Summary

72%

Page 5: Final Project Report MRI Reconstruction

5

It was found that out of all the functions in the matlab code provided to us, only recons_cs,

data_consistency_3d, RecPF_denoise were consuming 72% (548 secs /752secs) of the

timing. Therefore it must be sensible only to convert those functions into CUDA C.

FLOW CHART:

Figure 3. Flow chart

IMPLEMENTATION: The basics

We read our inputs from an asci file and we store the data in a cuComplex struct which is a

basic struct that has a two member variables x and y, representing the real and imaginart part

of the number respectively. Any sort of computation using cuComplex numbers must be

carried using some of the functions in the cuComplex.h library (the operators are not

overloaded to perform addition, multiplication etc).

We also use the cuFFT library which is part of the CUDA toolkit. cuFFT employs cuComplex

as it input data structure for any kind of FFT computation. cuFFT provides a method to do

both a forward and an inverse Fourier transform. We had to implement FFT shifts and FFT

inverse shifts ourselves.

Step 1•We Input the image file (256*232*88*32) into the GPU's global

memory

Step 2• Find the centre dimensions to determine the Critical image area and

filter the image using Tukeywin filter.

Step 3•Perform Inverse Fast fourier transform for all coils and sum them to

3D size.

Step 4•Now denoise the image using traditional RecPF-denoise method.

Step 5•Repeat the de-noise process for 25 iterations.

Page 6: Final Project Report MRI Reconstruction

6

The original K Space 3D data that is obtained

from the input is squeezed along the X

dimension to form a 2 D image data, where

the most critical data lie at the centre of the

image. It is needed to choose the critical data

from the input image through Y- start, Y-

end, Z-start and Z-end. Initially, the start and

end pints in both the dimensions point to the

centre most pixel of the image. We determine

a rectangle around this point which consists

of all valid data. Figure 4.Centre

dimension

This rectangle now contains the key image data that needs to be filtered and enhanced the

most, compared to other pixels in the image. To preserve the centre data and to reduce the

noise of surrounding elements, we filter the image using tukeywin filter. Sum the filtered

image along all 32 coils to get 3D input image of size 256x232x88.

The filtered 3D image is sent to de-noise function called RefPFdenoise and an image estimate

is found. Then the estimated 3D image is compared with 4D original image for all 32 coils and

the essential i.e. critical part of the image is preserved while the noise is cancelled out. We do

the above process for 25 times to completely eliminate the noise from the image as shown in

figure 1.

As GPU Engineers we are interested in parallel part of the algorithm. So we take the original

image and filtered image as our input. From figure 2 since the function RecPF denoise and

dataconsistency3d are taking much long compared to other functions. We are interested

implementing these two in GPU.

IMPLEMENTATION: RecRF Denoise

The second part of the code that could exploit its parallelism was the RecRF_denoise function.

We spent around a total of 175 seconds/3 minutes processing data in that function. This

function gets called 24 times (one less than the 3D data consistency part). So each function

call takes about 7.3 seconds. Our goal is to beat that number.

The RecRF Denoise Matlab code contained a lot of computations and library calls to both the

image processing toolkit and the signal processing toolkit. There are numerous FFT (Fourier

transform) computations and numerous finite difference computations along with

Page 7: Final Project Report MRI Reconstruction

7

manipulation of input data. The input data is the single image produced after each cycle of the

main reconstruction algorithm (mainly the 3D data consistency part).

The RecRF_Denoise first initializes constants for it’s finite difference method using constant

memory and compiles that to the device. It also initializes optimization parameters into

constant memory as well. The values in constant memory are retained across kernel calls.

Then RecRF_Denoise starts by calculating initial parameters for the calculation loop which

are a denominator and a numerator, both which require a FFT to be performed.

Following those initializations, we start our main loop which is divided into a few parts. We

would have wanted to create only one kernel for the entire processing of data, but because we

need to call cuFFT on our data throughout, we must divide the calculations into parts, and save

our current states using global memory so we can resume with cuFFT calculation outputs in

our calculations. The loop runs three times and solves the three big parts of the MATLAB

code: the W sub problem, the U sub problem and ends with Bergman update of values for next

run. We write the brute force approach to the code and then start optimizing.

For some of the computations, we reordered the variables in order to minimize number of

variables used in the brute force approach (and in the Matlab code). By removing the extra

variables we were able to minimize memory consumption and increase our performance.

Additionally, parts of the Matlab code were not very efficient: for example data that was

calculated but not used, etc. We also optimized the code to remove those necessities.

For the Fourier Transform code, the implementation was pretty straight forward. For the finite

difference methods, we used code that was published on the NVIDIA blog [5] and modified it

to run without input size, and with cuComplex data. The code used shared memory and we

had to play around with shared memory size because initially with our input size, the compiler

declared that we were allocating more shared memory than the device could allocate.

Combining the code proved a little challenging (we started running into problems with the

finite differences when we tried to put them all in one kernel call) so we abandoned that idea

specifically, especially since for the recRF denoise part, there were problems with the

verification.

VERIFICATION: RecPF Denoise

For the RecRF Denoise part, we started analysing the data against expected results (run by the

Matlab code) and we observed our output was entirely NaN (a problem we encountered when

running our Matlab initially with wrong configuration parameters). This led to an investigation

Page 8: Final Project Report MRI Reconstruction

8

of which parts were the problem. Starting from the beginning we notice that the results for the

numerator and denominator are not the same as those from Matlab.

At the moment, the method provides all nan + j nan for all it’s outputs. Sample below for first

two expected elements: -2.483373e-10+j-1.389864e-09 1.088936e-10+j6.383455e-10.

Further work should go into looking at what is causing such a problem. We ran out of time to

continue to triage this.

PERFORMANCE

Initially we implemented a CUDA program to perform the RecRF_Denoise function, which

achieved a performance of 40 seconds over 10 iterations. We noticed later that the calculations

for averages were incorrect (we weren’t dividing by 10) and thus that gave us a performance

of about 4 seconds per each iteration of the loop. This is smaller than 7 seconds so we are

getting some performance improvements (not much yet though). After further analysis, we

discovered that building using nvcc was building in debug mode, and after creating a new

MakeFile and running the kernel, we started achieving a huge performance increase. Now we

are getting less than half a second for run time. That’s about a 14x improvement (not a x100

yet but getting some great improvement).

We tried to change block size of finite difference part but we noticed not much increase in

performance by changing the sizes (we were already using long stencil size in our shared

memory which was already optimized).

The memory cost is huge for RecRF_Denoise, besides the input and output data, we allocate

10 other matrices to store our data, for a total cost of 256*232*88*8*(10+2) = 0.501743616

Gigabytes. It’s a little hard to calculate if we are bounded by memory or computation in our

case.

IMPLEMENTATION: dataconsistency3d

The source matlab function for dataconsistency3d function is shown below

sig_check = fftshift(fftn(ifftshift(img, 1))); sig_check(picks) = sig(picks); z = fftshift(ifftn(ifftshift(sig_check)), 1);

The main goal of this function is to retain the critical part of the image. During the RecPF de-

noise process all the image data are de-noised. Picks has the index of all critical parts of the

data. Sig is the original 4D data. Perform fft, replace with critical part, perform ifft and again

Page 9: Final Project Report MRI Reconstruction

9

send it to RecPF denoise process. This process continues for 25 times. Since lots of 3D fft and

ifft are performed, cuda implementation will fasten the process.

Conversion of matlab funtions such as fftshift and ifftshift into c code is the first step. Kernel

code for fftshift and ifftshift are implemented. The following table shows the comparison time

taken of fftshift & ifftshift function to run in CPU vs GPU. Input size is same as 3d matrix size

256x232x88.

Table 1. FFT and IFFT shift function – CPU vs GPU

Time taken to

execute in CPU

Time taken to

execute in GPU

Performance gain

Fftshift function 12.5ms 0.51ms 24.5x

Ifftshift function 12.945ms 0.753ms 17.19x

The fft and ifft functions are performed using cufft functions. The first step to perform a cufft

is to create a plan usinf cufftPlan function. A plan once created can be used later for subsequent

calls. We give the x,y,z dimensions to the plan and its type of computation say complex to

complex etc. The fft and ifft calls are made by executing the plan using cufftExec function.

The table below shows the time taken for dataconsistency3d function to execute in matlab and

GPU.

CUFFT Implementation in MATLAB :

cufftHandle plan_fft;

cufftPlan2d(&plan_fft, y,x,CUFFT_C2C);

cufftExecC2C(plan_fft, img_d, out_d, CUFFT_FORWARD);

cudaDeviceSynchronize();

CUFFT Implementation in MATLAB for IFFT:

cufftHandle plan_ifft;

cufftPlan2d(&plan_ifft, y,x,CUFFT_C2C);

cufftExecC2C(plan_ifft, img_d, out_d, CUFFT_INVERSE);

cudaDeviceSynchronize();

Page 10: Final Project Report MRI Reconstruction

10

Table 2. Time taken – Matlab Vs GPU for various Input Sizes

No of Coils

Used

Input sizes for

dataconsistency3d

function(in MB)

Time taken to

execute in

MATLAB(t1 in

ms)

Time taken to

execute in

GPU(t2 in

ms)

Performance

gain = (t1/t2)

1 39.875 975 57 17.10526316

2 79.75 1437 103.927 13.82701319

4 159.5 2083 200 10.415

8 319 4106 385.53 10.65027365

16 638 7761 777.987 9.975745096

32 39.875 15356 1455 10.55395189

Figure 5. Time taken – Matlab Vs GPU for various Input Sizes

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

39.875MB 79.75MB 159.5MB 319MB 638MB 1276MB

Tim

e in

ms

Input Size in MB

Time taken - Matlab vs GPU

Time taken to execute in MATLAB(t1 in ms) Time taken to execute in GPU(t2 in ms)

Page 11: Final Project Report MRI Reconstruction

11

Figure 6. Performance gain for various input sizes

In the MATLAB implementation, the time taken to compute the data consistency 3D function

increases drastically with increase in input size, reaching a maximum of about 16s for all the

32 coils. Whereas in the CUDA implementation, the maximum time is only 1.5s for all the 32

coils.

The FFT kernel is launched once for every coil. As the input coil size increases, the overhead

due to memory transfers between the host and the device memory also increases. Currently,

only 1 stream is being used for performing FFT / IFFT. Using multiple streams to pipeline the

tasks of memcopy and kernel execution can increase the performance further. We

experimented with multiple streams and software pipelining to further improve the throughput.

But, that causes output mismatch and we need to do further research to solve this issue. We

have attached all the implementation of the code in the Appendix for your reference.

VERIFICATION: dataconsistency3d

0

2

4

6

8

10

12

14

16

18

39.875MB 79.75MB 159.5MB 319MB 638MB 1276MB

Spe

ed

up

x

Input Size in MB

Performance gain for various input sizes

Page 12: Final Project Report MRI Reconstruction

12

Validating Reconstructed Image Using Peak Signal-to-Noise Ratio:

The Mean Square Error (MSE) and the Peak Signal to Noise Ratio (PSNR) are the two error

metrics used to compare image compression quality. The MSE represents the cumulative

squared error between the compressed and the original image, whereas PSNR represents a

measure of the peak error.

I0 is a known, “perfect” answer. This is typically found by performing the function in matlab.

The following table shows MSE and PSNR for different coil inputs.

Table 3. MSE and PSNR for various coils

Coils Mean Square

Error Maximum Io PSNR(db)

Coil1 2.76485E-05 0.8731 44.40455724

Coil2 0.000262461 0.8297 34.18777834

Coil3 2.52628E-05 0.8984 45.0445775

Coil4 2.45329E-05 0.8798 44.99019378

Coil5 4.57182E-07 0.9082 62.56273853

Coil6 6.1467E-06 0.9304 51.48697426

Coil7 1.42437E-05 0.9319 47.85117212

Coil8 1.46217E-06 0.9719 58.10246753

Coil9 9.14289E-05 0.97 40.12459774

Coil10 0.00010088 0.9363 39.39023257

Coil11 6.15868E-06 0.9399 51.56675881

Coil12 0.000287483 0.9498 34.96651801

Coil13 0.000110432 0.978 39.37581573

Coil14 4.6884E-05 0.9709 43.03324413

Coil15 6.61326E-05 0.9484 41.33567204

Coil16 0.000187387 0.8771 36.13359831

Page 13: Final Project Report MRI Reconstruction

13

Coil17 8.53287E-06 0.9545 50.28456843

Coil18 9.77867E-06 0.8778 48.9651119

Coil19 1.49731E-06 0.9177 57.50088891

Coil20 1.22615E-05 0.8416 47.6166772

Coil21 3.36993E-05 0.9124 43.92750306

Coil22 2.44395E-05 0.8917 45.12345043

Coil23 7.40974E-05 0.8703 40.09534734

Coil24 4.89955E-05 0.8472 41.65815335

Coil25 0.000186808 0.8877 36.25137087

Coil26 4.91792E-05 0.9376 42.52253555

Coil27 1.26168E-05 0.9754 48.77417303

Coil28 5.07543E-05 0.9687 42.66905996

Coil29 7.33233E-05 0.9274 40.69292106

Coil30 2.85137E-06 0.9467 54.97370418

Coil31 1.08874E-05 0.9251 48.95454347

Coil32 9.10939E-06 0.9596 50.04691095

Figure 7. PSNR of all Coils

As can be seen from the graph above, an average PSNR of 45 dB is obtained from the output

of all coils.

0

10

20

30

40

50

60

70

C1

C2

C3

C4

C5

C6

C7

C8

C9

C1

0

C1

1

C1

2

C1

3

C1

4

C1

5

C1

6

C1

7

C1

8

C1

9

C2

0

C2

1

C2

2

C2

3

C2

4

C2

5

C2

6

C2

7

C2

8

C2

9

C3

0

C3

1

C3

2

PSN

R in

de

cib

els

Coils

PSNR for all Coils

Page 14: Final Project Report MRI Reconstruction

14

CONCLUSION

The bottleneck functions in the MATLAB code for MRI Reconstruction were identified. Of

the two bottleneck functions, the RecPF denoise function did not exhibit any kind of

parallelism and works better when implemented sequentially. The parallel implementation of

the other function, data consistency 3d, however showed considerable speedup (10.55x) over

its sequential counterpart. Further improvements can be made by performing the cuda FFT /

IFFT of multiple coils simultaneously over parallel streams. The FFT and IFFT shift were

implemented in the native CPU as well as in CUDA GPU. The parallel implementation of

FFT shift is found to be 24x faster, while the IFFT shift is found to be 17x faster. Overall, we

experimented with multiple ways to improve the existing MATLAB code and were able to

successfully accelerate the computational speed.

REFERENCE

1. https://developer.nvidia.com/cufft.

2. GPU computing gems (Emerald Edition) – Wen-mei W. Hwu, ISBN: 978-0-12-

384988-5.

3. http://developer.nvidia.com/object/matlab_cuda.html.

4. http://mri-q.com/index.html

5. http://devblogs.nvidia.com/parallelforall/finite-difference-methods-cuda-cc-part-1/

6. University of Waterloo. (2007).

http://www.science.uwaterloo.ca/˜hmerz/CUDA_benchFFT/

APPENDIX A. CUDA kernel code for FFT shift _global__ void cufftShift_2D_kernel(cuComplex* input, cuComplex* output,

int Nx, int Ny)

Page 15: Final Project Report MRI Reconstruction

15

{

// 2D Slice & 1D Line

int sLine = Ny;

int sSlice = Nx * Ny;

// Transformations Equations

int sEq1 = (sSlice + sLine) / 2;

int sEq2 = (sSlice - sLine) / 2;

__syncthreads();

// Thread Index (1D)

int xThreadIdx = threadIdx.x;

int yThreadIdx = threadIdx.y;

__syncthreads();

// Block Width & Height

int blockWidth = blockDim.x;

int blockHeight = blockDim.y;

__syncthreads();

// Thread Index (2D)

int xIndex = blockIdx.x * blockWidth + xThreadIdx;

int yIndex = blockIdx.y * blockHeight + yThreadIdx;

__syncthreads();

// Thread Index Converted into 1D Index

int index = (yIndex * Nx) + xIndex;

__syncthreads();

if (xIndex < Nx / 2)

{

if (yIndex < Ny / 2)

{

// First Quad

output[index] = input[index + sEq1];

__syncthreads();

}

else

{

// Third Quad

output[index] = input[index - sEq2];

__syncthreads();

}

}

else

{

Page 16: Final Project Report MRI Reconstruction

16

if (yIndex < Ny / 2)

{

// Second Quad

output[index] = input[index + sEq2];

__syncthreads();

}

else

{

// Fourth Quad

output[index] = input[index - sEq1];

__syncthreads();

}

}

}

APPENDIX B. CUDA kernel code for IFFT shift

__global__ void cuifftShift_2D_kernel(cuComplex* input, cuComplex* output, int

Nx, int Ny)

{

// 2D Slice & 1D Line

int sLine = Ny;

int sSlice = Nx * Ny;

// Transformations Equations

int sEq1 = (sSlice + sLine) / 2;

int sEq2 = (sSlice - sLine) / 2;

__syncthreads();

// Thread Index (1D)

int xThreadIdx = threadIdx.x;

int yThreadIdx = threadIdx.y;

__syncthreads();

// Block Width & Height

int blockWidth = blockDim.x;

int blockHeight = blockDim.y;

__syncthreads();

// Thread Index (2D)

int xIndex = blockIdx.x * blockWidth + xThreadIdx;

int yIndex = blockIdx.y * blockHeight + yThreadIdx;

__syncthreads();

// Thread Index Converted into 1D Index

int index = (yIndex * Nx) + xIndex;

__syncthreads();

Page 17: Final Project Report MRI Reconstruction

17

if (xIndex < Nx / 2)

{

// First Half

output[index] = input[index + Nx/2];

__syncthreads();

}

else

{

// Second Half

output[index] = input[index - Nx/2];

__syncthreads();

}

}

APPENDIX C.CUDA code for FFT/IFFT implementation using multiple streams

cudaStream_t streams[4];

for(int i=0; i<8 ; i++) {

dim3 dimblock(16,16,1);

dim3 dimgrid(256/16,232*88/16,1);

cudaMemcpy(img_d, img_h+i*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_0,img_d,x,y);

cudaDeviceSynchronize();

cudaMemcpy(img_d, img_h+(i+1)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_1,img_d,x,y);

cudaDeviceSynchronize();

cudaMemcpy(img_d, img_h+(i+2)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_2,img_d,x,y);

cudaDeviceSynchronize();

cudaMemcpy(img_d, img_h+(i+3)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_3,img_d,x,y);

cudaDeviceSynchronize();

cufftHandle* plans = (cufftHandle*) malloc(sizeof(cufftHandle)*2);

for (int i = 0; i < 4; i++) {

cufftPlan2d(&plans[i], y,x, CUFFT_C2C );

cufftSetStream(plans[i], streams[i]);

}

cufftExecC2C(plans[0], in_d_0, out_d_0, CUFFT_FORWARD);

cufftExecC2C(plans[1], in_d_1, out_d_1, CUFFT_FORWARD);

cufftExecC2C(plans[2], in_d_2, out_d_2, CUFFT_FORWARD);

cufftExecC2C(plans[3], in_d_3, out_d_3, CUFFT_FORWARD);

Page 18: Final Project Report MRI Reconstruction

18

for(int i = 0; i < 4; i++)

cudaStreamSynchronize(streams[i]);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_00,out_d_0,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+i*(inputSize), out_d_00, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_11,out_d_1,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+1)*inputSize, out_d_11, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_22,out_d_2,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+2)*inputSize, out_d_00, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_33,out_d_3,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+ (i+3)*inputSize, out_d_00, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

for(int l=0 ;l<picks_size*4; l++)

{

out_h[picks_h[l]].x = sig_h[picks_h[l]].x;

out_h[picks_h[l]].y = sig_h[picks_h[l]].y;

}

cudaMemcpy(out_d_00, out_h+i*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cudaMemcpy(out_d_11, out_h + (i+1)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_0,out_d_00,x,y);

cudaDeviceSynchronize();

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_1,out_d_11,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_d_22, out_h+ (i+2)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cudaMemcpy(out_d_33, out_h + (i+3)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_2,out_d_22,x,y);

cudaDeviceSynchronize();

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_3,out_d_33,x,y);

cudaDeviceSynchronize();

cufftExecC2C(plans[0], in_d_0, out_d_0, CUFFT_INVERSE);

cufftExecC2C(plans[1], in_d_1, out_d_1, CUFFT_INVERSE);

cufftExecC2C(plans[0], in_d_2, out_d_2, CUFFT_INVERSE);

cufftExecC2C(plans[1], in_d_3, out_d_3, CUFFT_INVERSE);

for(int i = 0; i < 4; i++)

cudaStreamSynchronize(streams[i]);

Page 19: Final Project Report MRI Reconstruction

19

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_0,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+i*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_1,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+1)*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_2,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+2)*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_3,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+3)*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);