14
Image Reconstruction on Multicore Processors Graduate Students Eric Fontaine and Viraj Paropkari Faculty Members: Ada Gavrilovska and Hsien-Hsin S. Lee

Image Reconstruction on Multicore Processors

Embed Size (px)

DESCRIPTION

Image Reconstruction on Multicore Processors. Graduate Students Eric Fontaine and Viraj Paropkari Faculty Members: Ada Gavrilovska and Hsien-Hsin S. Lee. Agenda. Background FDK algorithm Overview Parallelization Method Current Results Katsevich Algorithm Overview - PowerPoint PPT Presentation

Citation preview

Page 1: Image Reconstruction on Multicore Processors

Image Reconstruction on Multicore Processors

Graduate StudentsEric Fontaine and Viraj Paropkari

Faculty Members:Ada Gavrilovska and Hsien-Hsin S. Lee

Page 2: Image Reconstruction on Multicore Processors

2

Agenda

• Background

• FDK algorithm– Overview– Parallelization Method– Current Results

• Katsevich Algorithm– Overview– Parallelization Method– Current Results

• Future Plans

Page 3: Image Reconstruction on Multicore Processors

3

Background• Use 3-D CT scan to identify tumors and other

defects inside the body.• Two common methods

– MRI• Complex math and physics• Main function ─ Simple IFFT

– Filtered back-projection • Two common filtered back-projection algorithms

– FDK • Approximation, fast• Use projections taken on a circular path surrounding the

object• More accurate on the plane containing the circle

– Katsevich• More accurate, but also more compute-intensive • Use projections taken on a helical path surrounding the

object • It can reconstruct long objects, unlike the original FDK.

• Both contain large data parallelism

Page 4: Image Reconstruction on Multicore Processors

4

FDK Algorithm Overview• Cone beam image reconstruction with

source on a helix for a flat detector• Reconstruction for 3-D volume• Initialize the helix source parameters• Compute/load cone beam data• Length correction weighting• 1-D horizontal filtering• Linear Pre-interpolation• Back projection• Compare Results with standard phantom

Page 5: Image Reconstruction on Multicore Processors

5

Parallelization Strategy• Based on FDK algorithm for general scanning

paths like helix.*

– Each thread is assigned a subset of the total number of projections, and performs length correction weighting, filtering and back-projections of its assigned projections.

– After all threads are done, there is an implicit barrier necessary for synchronization. Then each thread is assigned a subset of the total volume to reconstruct.

– We use OpenMP • Reconstruct subsets of the total volume in parallel (to fit into

individual cache)• Piece the image together at the end (reduced inter-core

communication)

Assign Projections

Length correction weighting, filtering, back-projection barrier

Length correction weighting, filtering, back-projection

*Ge Wang, Tein-Hsiang Lin, Ping-chin Cheng, and Douglas M. Shinozaki. A general cone-beam reconstruction algorithm. IEEE Trans. On Medical Imaging, 12(3):486-496, September 1993

Reconstructed Image

Page 6: Image Reconstruction on Multicore Processors

6

Single and Dual-Thread Performance

Single Thread 13 17 49 311 2416

Dual Thread 31 33 50 184 1263

16^3 32^3 64^3 128^3 256^30

0.5

1

1.5

2

Speedup 0.419 0.515 0.98 1.69 1.912

16^3 32^3 64^3 128^3 256^3

Slowdown

Performance (Seconds)Speedup of

dual-thread OpenMP code

Page 7: Image Reconstruction on Multicore Processors

7

FDK Analysis for Memory Behavior

0

5

10

15

20

25

30

35

40

45

L1 Miss% 23.65 38.62 36.93 34.08 21.5

L2 Miss% 0.64 41.74 23.75 18.6 14.39

DTLB Miss% 6.67 9.95 12.93 12.5 14.79

16^3 32^3 64^3 128^3 256^3

0

20

40

60

80

100

L1 Miss% 81.8 90.55 91.45 91.58 96.58

L2 Miss% 1.64 1.76 1.78 1.87 6.13

DTLB Miss% 10.27 12.05 11.67 12.86 11.71

16^3 32^3 64^3 128^3 256^3

Statistics of Single Thread Statistics of Two Threads

Page 8: Image Reconstruction on Multicore Processors

8

Katsevich Algorithm Overview• Reconstructs a 3-D cylindrical volume exactly from

2-D projections.[1]

– The inputs are projections (b) taken from a helical path surrounding the volume of interest (a).

• Implemented the Noo method [2]:– These projections are differentiated and weighted

appropriately (c).– These undergo a 1-D Hilbert transform along the κ-

lines.• First undergo remapping to κ-line coordinates (d).• Perform 1-D convolution w/ filter kernel (e).• Return to projection coordinates by remapping (f).

– To reconstruct the 3-D volume (g), each voxel’s coordinates is back projected the source projections

• The cumulative sum is taken for all projections belonging to the PI-interval containing that voxel.

• Used similar parallelization strategy to FDK– Each thread processes a subset of the projections.– After synchronization, each thread reconstructs a

subset of the total volume.[1] Alexander Katsevich, "Theoretically exact FBP-type inversion algorithm for spiral CT", Society for Industrial and Applied Mathematics Journal on Applied Mathematics, 62:2012-2026, 2002.

[2] F. Noo, J. Pack, and D. Heuscher, “Exact helical reconstruction using native cone-beam geometries,” Physics in Medicine and Biology, vol. 48, pp. 3787–3818, 2003.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Page 9: Image Reconstruction on Multicore Processors

9

Results

Speedup using 2 Threads (single-precision)

1.993 1.974 1.9651.940

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

64 128 256 512

Width of Reconstructed Volume

Sp

eed

up

ove

r 1

thre

ad

Reconstruction Time (single-precision)

0.5 6.1

1454.6

0.2 3.1 45.890.5

740.1

0

200

400

600

800

1000

1200

1400

1600

64 128 256 512

Width of Reconstructed Volume

Tim

e in

sec

on

ds

1 Thread

2 Threads

• Using Intel Core2 Duo @ 2.66 GHz.• Close to 2x speedup

Reconstruction Time (double precision)

0.6 8.8136.0

2133.0

0.3 4.4 68.9

1087.0

0

500

1000

1500

2000

2500

64 128 256 512

Width of Reconstructed Volume

Tim

e in

sec

on

ds

1 thread

2 threads

Speedup using 2 Threads (double precision)

1.984 1.980 1.974 1.962

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

64 128 256 512

Width of Reconstructed Volume

Sp

eed

up

ove

r 1

thre

ad

Page 10: Image Reconstruction on Multicore Processors

10

Image Quality

512^3 Reconstruction512 Projections per Turn, 512x64 size projections

512^3 original Phantom

Page 11: Image Reconstruction on Multicore Processors

11

Benchmark

• Compared against the published timing results in [3], which used 64-bit AMD Opteron processors.

• Unable to determine exact parameters used by author of [3], so the comparison may be questionable.

Comparison to Published Results in [3] (single-threaded, double precision)

9 136

2133

226

975

6013

0

1000

2000

3000

4000

5000

6000

7000

128 256 512

Width of Reconstructed Volume

Tim

e in

sec

onds

my results

uiowa results

[3] Deng, J., Yu, H., Ni, J., He, T., Zhao, S., Wang, L., and Wang, G. 2006. A Parallel Implementation of the Katsevich Algorithm for 3-D CT Image Reconstruction. J. Supercomput. 38, 1 (Oct. 2006), 35-47.

Page 12: Image Reconstruction on Multicore Processors

12

Optimizations Used

• Majority of time spent during backprojection and determining the PI-intervals.– PI-intervals are constant for a particular helix.

• PI-intervals are precomputed and saved to a file.– Only necessary to precompute PI-intervals for one horizontal slice.

• PI-intervals for different horizontal slices can be determined by rotation.– Easy ~25% speedup

Time Breakdown for Different Stages (initial version)

Differentiation andFiltering

Determine PI-intervals

PerformBackprojection

Other

Page 13: Image Reconstruction on Multicore Processors

13

Optimizations Used• Next focused on backprojection inner loop.

– Removed trival lookup tables to save cache space.

• ~10% speedup.

– Used sin, cos lookup tables• ~15% speedup.

– Moved if statements for smoothing the ends of the PI-interval outside the loop.

• Duplicated inner loop code.• ~10% speedup.

– Removed if statements required for bounds testing the backprojected coordinates.

• Needed to add extra row and column slack to projection data.

• ~3% speedup.

Page 14: Image Reconstruction on Multicore Processors

14

Future work• Explore memory layout to reduce

cache misses and page faults.• Implement the same algorithms on

Cell processor for competitive analysis.