9
Computerized Medical Imaging and Graphics 51 (2016) 11–19 Contents lists available at ScienceDirect Computerized Medical Imaging and Graphics j ourna l h om epa ge : www.elsevier.com/locate/compmedimag Multi-modal vertebrae recognition using Transformed Deep Convolution Network Yunliang Cai a , Mark Landis b , David T. Laidley b , Anat Kornecki b , Andrea Lum b , Shuo Li a,b,a Dept. of Medical Biophysics, Schulich School of Medicine and Dentistry, University of Western Ontario, 1151 Richmond St, London, ON, Canada b Dept. of Medical Imaging, Schulich School of Medicine and Dentistry, University of Western Ontario, 1151 Richmond St, London, ON, Canada a r t i c l e i n f o Article history: Received 31 July 2015 Received in revised form 24 January 2016 Accepted 29 February 2016 Keywords: Vertebra detection Vertebra recognition Deep learning Convolution network a b s t r a c t Automatic vertebra recognition, including the identification of vertebra locations and naming in multi- ple image modalities, are highly demanded in spinal clinical diagnoses where large amount of imaging data from various of modalities are frequently and interchangeably used. However, the recognition is challenging due to the variations of MR/CT appearances or shape/pose of the vertebrae. In this paper, we propose a method for multi-modal vertebra recognition using a novel deep learning architecture called Transformed Deep Convolution Network (TDCN). This new architecture can unsupervisely fuse image fea- tures from different modalities and automatically rectify the pose of vertebra. The fusion of MR and CT image features improves the discriminativity of feature representation and enhances the invariance of the vertebra pattern, which allows us to automatically process images from different contrast, res- olution, protocols, even with different sizes and orientations. The feature fusion and pose rectification are naturally incorporated in a multi-layer deep learning network. Experiment results show that our method outperforms existing detection methods and provides a fully automatic location + naming + pose recognition for routine clinical practice. © 2016 Elsevier Ltd. All rights reserved. 1. Introduction Magnetic resonance imaging (MR) and computed tomogra- phy (CT) are two main imaging methods that are intensively and interchangeably used by spine physicians. The longitudi- nal/differential diagnoses today are often conducted in large MR/CT dataset which makes manual identification of vertebrae a tedious and time-consuming task. Automatic locate-and-name system of spine MR/CT images which supports quantitative measurement is thus highly demanded for orthopaedics, neurology, and oncology. Automatic vertebra recognition, particularly the identification of vertebra location, naming, and pose (orientation + scale), is a chal- lenging problem in spine image analysis. The main difficulty arises from the high variability of image appearance due to image modali- ties or shape deformations of the vertebraes: (1) Vertebra is difficult to detect due to imaging modalities. The image resolution, contrast and appearance for the same spine structure could be very differ- Corresponding author at: Dept. of Medical Biophysics, Schulich School of Medicine and Dentistry, University of Western Ontario, 1151 Richmond St, London, ON, Canada. E-mail address: [email protected] (S. Li). ent when it is exposed to MR/CT, or T1/T2 weighted MR images. (2) Vertebra is difficult to automatically name. The vertebrae and intervertebral discs are lack of unique characteristic features that automatic naming could fail easily. (3) Vertebra pose is difficult to estimate. The poses of vertebrae are highly diverse and little stable features can be used for pose estimation. Except for the local pose and appearance problems, the global geometry of spine is often dif- ficult to recover in some medical situations, i.e., spine deformity and scoliosis. The reconstruction of global spine geometry from limited CT/MR slices can be ill-posed and requires sophisticated learning algorithms. Most current spine detection methods focus on identification of vertebra locations or labels in particular one particular image modality [1–5], and vertebra pose information 9s seldom obtained in the same method. (1) For vertebra localization, learning-based detectors were employed for handling specified image modali- ties, they were proven to work on CT (generalized Hough) [2], MR (Adaboost) [3], or DXA images (random forest) [6]. Their training and testing were performed on the chosen image protocol only. Some detection methods claimed they can work on both MR and CT. ˇ Stern et al. [7] utilized the curved spinal geometric structure extracted from both modality. Kelm et al. and Lootus et al. [8,9] used boosting-trained Haar features and SVM-trained Histogram http://dx.doi.org/10.1016/j.compmedimag.2016.02.002 0895-6111/© 2016 Elsevier Ltd. All rights reserved.

Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

MC

YSa

b

a

ARRA

KVVDC

1

pandastAvlftta

MO

h0

Computerized Medical Imaging and Graphics 51 (2016) 11–19

Contents lists available at ScienceDirect

Computerized Medical Imaging and Graphics

j ourna l h om epa ge : www.elsev ier .com/ locate /compmedimag

ulti-modal vertebrae recognition using Transformed Deeponvolution Network

unliang Caia, Mark Landisb, David T. Laidleyb, Anat Korneckib, Andrea Lumb,huo Lia,b,∗

Dept. of Medical Biophysics, Schulich School of Medicine and Dentistry, University of Western Ontario, 1151 Richmond St, London, ON, CanadaDept. of Medical Imaging, Schulich School of Medicine and Dentistry, University of Western Ontario, 1151 Richmond St, London, ON, Canada

r t i c l e i n f o

rticle history:eceived 31 July 2015eceived in revised form 24 January 2016ccepted 29 February 2016

eywords:ertebra detectionertebra recognitioneep learning

a b s t r a c t

Automatic vertebra recognition, including the identification of vertebra locations and naming in multi-ple image modalities, are highly demanded in spinal clinical diagnoses where large amount of imagingdata from various of modalities are frequently and interchangeably used. However, the recognition ischallenging due to the variations of MR/CT appearances or shape/pose of the vertebrae. In this paper, wepropose a method for multi-modal vertebra recognition using a novel deep learning architecture calledTransformed Deep Convolution Network (TDCN). This new architecture can unsupervisely fuse image fea-tures from different modalities and automatically rectify the pose of vertebra. The fusion of MR andCT image features improves the discriminativity of feature representation and enhances the invariance

onvolution network of the vertebra pattern, which allows us to automatically process images from different contrast, res-olution, protocols, even with different sizes and orientations. The feature fusion and pose rectificationare naturally incorporated in a multi-layer deep learning network. Experiment results show that ourmethod outperforms existing detection methods and provides a fully automatic location + naming + poserecognition for routine clinical practice.

© 2016 Elsevier Ltd. All rights reserved.

. Introduction

Magnetic resonance imaging (MR) and computed tomogra-hy (CT) are two main imaging methods that are intensivelynd interchangeably used by spine physicians. The longitudi-al/differential diagnoses today are often conducted in large MR/CTataset which makes manual identification of vertebrae a tediousnd time-consuming task. Automatic locate-and-name system ofpine MR/CT images which supports quantitative measurement ishus highly demanded for orthopaedics, neurology, and oncology.utomatic vertebra recognition, particularly the identification ofertebra location, naming, and pose (orientation + scale), is a chal-enging problem in spine image analysis. The main difficulty arisesrom the high variability of image appearance due to image modali-

ies or shape deformations of the vertebraes: (1) Vertebra is difficulto detect due to imaging modalities. The image resolution, contrastnd appearance for the same spine structure could be very differ-

∗ Corresponding author at: Dept. of Medical Biophysics, Schulich School ofedicine and Dentistry, University of Western Ontario, 1151 Richmond St, London,N, Canada.

E-mail address: [email protected] (S. Li).

ttp://dx.doi.org/10.1016/j.compmedimag.2016.02.002895-6111/© 2016 Elsevier Ltd. All rights reserved.

ent when it is exposed to MR/CT, or T1/T2 weighted MR images.(2) Vertebra is difficult to automatically name. The vertebrae andintervertebral discs are lack of unique characteristic features thatautomatic naming could fail easily. (3) Vertebra pose is difficult toestimate. The poses of vertebrae are highly diverse and little stablefeatures can be used for pose estimation. Except for the local poseand appearance problems, the global geometry of spine is often dif-ficult to recover in some medical situations, i.e., spine deformity andscoliosis. The reconstruction of global spine geometry from limitedCT/MR slices can be ill-posed and requires sophisticated learningalgorithms.

Most current spine detection methods focus on identificationof vertebra locations or labels in particular one particular imagemodality [1–5], and vertebra pose information 9s seldom obtainedin the same method. (1) For vertebra localization, learning-baseddetectors were employed for handling specified image modali-ties, they were proven to work on CT (generalized Hough) [2], MR(Adaboost) [3], or DXA images (random forest) [6]. Their trainingand testing were performed on the chosen image protocol only.

Some detection methods claimed they can work on both MR andCT. Stern et al. [7] utilized the curved spinal geometric structureextracted from both modality. Kelm et al. and Lootus et al. [8,9]used boosting-trained Haar features and SVM-trained Histogram
Page 2: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

12 Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19

F niformf .

omCvsoi[movtTmsogtc

fppbsaatf

ig. 1. The multi-modal recognition for lumbar spine imaging. The modalities are urom different modalities are fused and enhanced by each other via a deep network

f Oriented Gradients (HOG) respectively. However, these cross-odality methods often required the separated training for MR and

T, and thus the separated testing for the two modalities too. (2) Forertebrae naming, [2–5] had successful labeling on fully or partiallycanned image volumes. Their methods relied on the identificationf some special landmarks detected from multiple image views,.e., axial view templates [2], spinal canals [5] or anchor vertebrae3], while the exact labels are inferred by a probability inference

odel, i.e., a graph model [10], Hidden Markov Model (HMM) [4],r hierarchical model [3,18]. (3) Besides the detection and naming,ertebral pose is critical information in orthopedics. Pose estima-ion was used [1,8,5] for extracting the 3D structure of the spines.hese estimation methods exploited the multi-planar detector toatch the correct vertebrae poses, but can not directly used in a

ingle slice input. In addition, most of the training-based meth-ds, as pointed out in [11], required dense manual annotations forround truth labels, i.e., annotations of all the corners and the cen-er for each vertebrae. This makes the training-based method notonvenient to use.

To overcome these limitations, we uniquely propose a unifiedramework using Transformed Deep Convolution Network (TDCN) torovide automatic cross modality vertebrae location, naming, andose estimation. As presented in Fig. 1, our system is a learning-ased recognition system which contains a multi-step trainingtage and an efficient testing stage. The example results on MRnd CT are shown Fig. 2. The main ingredients of the system is

novel deep learning model [12] inspired by groupwise registra-ion [13,14] and multi-modal feature fusion [15,16]. We have theollowing contributions in this paper:

Vertebra recognition. The location, name, and pose(scale + orientation) of each vertebra are identified simulta-neously. Further spine shape analysis, i.e., spondylolysis analysis,is then possible basing on the recognition results.

Fig. 2. Examples of detection and naming

ly trained and detected in one unified recognition system. In this system, features

• Multi-modal feature learning. The vertebra features are jointlylearned and fused from both MR and CT. This enhancesthe features discrimination and improves the classification ofvertebra/non-vertebra.

• Invariant representation. In the training and recognition stage, thesampled of detected vertebrae are automatically aligned, gen-erating transform-invariant feature representations or rectifiedfinal poses respectively.

• Simple annotation. Thanks to the invariant representation, ourmethod only requires single-clicking for each vertebrae in groundtruth annotation while other methods [8,5,9] require four clicksor more.

2. The Transformed Deep Convolution Network

The Transformed Deep Convolution Network (TDCN) is a noveldeep network structure, which can automatic extract the best rep-resentative and invariant features for MR/CT. It employs MR–CTfeature fusion to enhance the feature discriminativity, and appliesalignment transforms for input data to generate invariant repre-sentation. This resolves the modality and pose variation problemsin vertebra recognition. The overall structure of TDCN presentedin Fig. 3. The two major components in TDCN: the feature learningunit and the multi-modal transformed appearance learning unit arepresented in details as follows.

2.1. Low level feature learning unit

The low-level feature learning unit is for unsupervised learningadaptive features that best represent the training MR/CT sam-ples. The feature learning is implemented by layers of Convolution

Restricted Boltzmann Machines (CRBM) [12]. CRBM is a multi-outputfiltering system which can adaptively update its filter weights toobtain the best approximative feature maps for the training sam-ples. The learned features can reveal some unique micro-structures

result for both MR and CT images.

Page 3: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19 13

F rent tc e.

it

tEbto

E

wf•tw

itt2Ko

f

wvfs{Ft

ig. 3. The structure of Transformed Deep Convolution Network (TDCN). Two diffeomponent is applied in the training stage, and will be bypassed in the testing stag

n MR or CT, while manual features like SIFT or HOG cannot adapto these micro-differences.

As shown in Fig. 3, there are two CRBM layers for feature extrac-ion over MR patches and two for the extraction over CT patches.ach CRBM will accept a set of visible layers v as input, generatinginary hidden layers h. In the training stage, the CRBM continuouslyunes the weight coefficients Wk for of the k ∈ {1, . . ., K} filter, inrder to minimize the following functional

(v, h) = −K∑

k=1

hk • (Wk ∗ v) −

K∑

k=1

hkbk

i,j

hki,j − c

i,j

vi,j (1)

here bk and c are adjustable bias, i, j will range in [1, NV − NW + 1]or a NV × NV input layer v = [vi,j] and a NW × NW filter Wk. Operator

is a element-wise product of two matrixes, while * is a convolu-ion. The hidden layer generated by Wk is a NH × NH map hk = [hk

i,j]

ith NH = NV − NW.Two CRBMs are stacked together (levels 1 and 2 in Fig. 3), provid-

ng two level feature extraction. Each CRBM will first unsuperviselyrain itself until (1) is minimized, then connected its output(input)o the higher(lower) level CRBM. Let level 1 has K1 filters and level

has K2 filters, the output of the two-level feature learning will be1 × K2 feature maps. For a given 2D filter W and bias b, the outputf the CRBM can be defined by:

(v) = �(W ∗ v + b) (2)

here � the element-wise logistic sigmoid function and v is in itsectorized form. Suppose f1 is the output of level 1 CRBM in Fig. 4,or MR–CRBM (or CT-CRBM), we let f1 be defined in (2) by sub-tituting the corresponding W with the K1 filters’ coefficients, i.e.,

W1

MR, . . ., WK1MR} (or {W1

CT, . . ., WK1CT }). Similarly, the level 2 CRBM in

ig. 4 output mapping f2 is defined by substituting W in (2) withhe corresponding K2 filters’ coefficients.

Fig. 4. The fusion of MR and CT feature using RBM in TDCN (level 3).

ypes of layer (CRBM and RBM) are used in the network. Note that the congealing

2.2. Multi-modal transformed appearance learning unit

Multi-modal feature fusion is important in classification as themissing of some features in one modality will be compensated bythe others, i.e., the missing vertebra discs in CT can be compen-sated by its MR counterparts. Also, the features are expected to bepose invariant, as geometric invariance can enhance the discrim-ination of vertebra/non-vertebra. The multi-modal transformedappearance learning unit is a set of deep network layers specificallyfor fusing multi-modal feature and transforming input samplesto achieve invariant representations. With the fusion of low-levelfeatures from different modalities, the feature representation ofvertebra can be enhanced to obtain a better discrimination.

MR/CT feature fusion using RBM. The feature maps extracted byCRBM are sent to two Restricted Boltzmann Machines (RBM) formulti-modal feature fusion. RBM is a simpler version of CRBM,which works with vectorized visible and hidden layers other than2D layers in CRBM. It also can reduce the dimensionality of the inputby setting a shorter hidden layer size. Suppose now v = [vi] ∈ R

NV

is the visible layer, which can be either MR or CT features, andh = [hj] ∈ R

NH is the hidden layer, the particular functional RBMminimizes is:

E(v, h) = −∑

i,j

viWijhj −∑

j

bjhj −∑

i

civi (3)

where Wij now is a NV × NH matrix and bj, ci are adjustable bias. Asshown in Fig. 4, the MR/CT feature maps obtained from the previouslow-level feature learning unit are vectorized and combined as thevisible layer of the RBM, generating the MR–CT fused feature vectoras the hidden layer of the RBM. The fusion process is performedspecifically by the level 3 RBM in TDCN (see the level 3 layer inFig. 3). The result of level 3 RBM is a dimension-reduced unifiedfeature vector that jointly describe the vertebra patterns in MR andCT.

Similar to (2), we can define the output mapping of RBM asnonlinear activation function:

f (v) = �(WT v + b) (4)

where v, b are vectors other than 2D maps in (2). The mapping oflevel 3 RBM in Fig. 3, particularly f3, is thus constructed by substi-tuting (4) with the learned weight matrix and bias from training,i.e., W3 and b3. f4 for the level 4 RBM can be obtained in the sameway.

The multi-modal fusion has a number of advantages that cannot

be provided by single-modal feature learning (i.e., features learnedonly from level 1, 2 while bypassing level 3 and 4 in TCDN): (1)MR (T1,T2) and CT features can be compensated for each other inthe fused feature vectors; (2) Shape features, which are rich in both
Page 4: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

14 Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19

F MR + Cf

mfM

issagrmwp

e3wacsaTm

C

tsoiub

C )

wftoa

ig. 5. Congealing for vertebra patches across MR and CT. Left to right: the original

our iterations of congealing computation.

odalities, are enhanced in the fused feature vector; (3) The fusedeature vector can be used for identifying vertebra structures across

R and CT using only one unified classifier.Pose rectification by congealing. Pose rectification is for unify-

ng the different orientation and sizes (scales) of vertebrae in apine scan, so that the resulting learned feature can be more repre-entative for vertebra. The vertebra appearance can be considereds invariant such that any vertebra patch can be assumed as beenerated by warping the standard invariant patch with certainotation and rescaling. This warping uniquely describes the 2D geo-etric transform from a standard model to a sampled data, whiche describe as the pose of the vertebra. The pose rectification iserformed by congealing.

Congealing is a groupwise registration for an image set. It ismployed for aligning the 2D feature maps obtained from the level

RBM. Suppose v is an arbitrary vertebra image patch (MR or CT)ith free-form size/orientation, inverse warping G−1 deforms v to

m × m regular patch, denoted as v ◦ G−1. Let {v1, . . ., vn} be theollection patches, the corresponding warped versions are repre-ented as vector {v1 ◦ G−1

1 , . . ., vn ◦ G−1n }. We require G = G(r, s) be

similarity transform described by the rotation r and scaling s.he congealing for the patch collection can be formulated as theinimization of energy:

(r, s) =n∑

i,j=1,i /= j

||vi ◦ G−1i

(ri, si) − vj ◦ G−1j

||2. (5)

As v can be either a MR or a CT patch, we cannot directly alignhese two types of patches in (5) unless they have a unified repre-entation. This representation can be obtained by the feature fusionf RBM (Fig. 4) where the fused feature mapping allows us to alignmages from different modalities in the same congealing model. Bysing the CRBM mapping f1, f2 and the RBM mapping f3 obtainedy activation (2) and (4) respectively, (5) can be then revised as

(r, s) =n∑

i,j=1,i /= j

||f3( �f2(f1(vi ◦ G−1i

(ri, si)))) − f3( �f2(f1(vj ◦ G−1j

)))||2,(6

here �· is the vectorization. f1, f2 will extract the low-level MR/CT

eatures of patch v and f3 will map them into a unified feature vec-or. The deep congealing (6) not only can align patches in either MRr CT as the original congealing does (5) but also can align the MRnd CT patches in the same patch set.

Fig. 6. The overall training process of ou

T patch set, the aligned patch set, the means of the MR + CT patches during the first

Fig. 5 illustrate a toy example of congealing over CT + MR imagesby minimizing (6). The vertebra patches in the three images arefirst mapped to unified feature maps using f1, f2 and f3. The deepcongealing (6) is then applied, obtaining the aligned patches. Notethat the bounding boxes of the patches under congealing can revealthe correct poses of the corresponding vertebrae. Therefore, fromthe congealing operation, we can rectify the correct poses of thevertebrae.

3. The training of multi-modal recognition system

The training process of the TDCN system is presented in Fig. 6.The process starts with the annotation of sample patches in originalscans. It then trains TCDN using the selected samples, generatinginvariant vertebra features. The features are applied in training aSVM, obtaining the desired vertebra classifier.

One-click sample annotation. The training samples (posi-tive/negative) are collected by simple clicking operations in imageslices. For positive samples, the user only need to click the vertebracenter in a training image slice regardless of the size and orienta-tion of the vertebra. This significantly simply the sample collectionprocess in other methods (i.e., [8,9]) where at least four clicks areneeded for each vertebra. A coarse labeling of vertebra type: T/L orS vertebra is required after clicking but no further specific labelingis needed. This also simplifies the previous training-based systemwhich detailed labeling is often required. For negative samples, thenon-vertebra patches can be simply obtained by random samp-ling or clicking. The ratio of the number of vertebra to that ofnon-vertebra can be set around 1/10 to 1/8.

TDCN feature learning. The selected training samples are servedas the input of the TDCN (Fig. 3) according to their originalmodalities. The TDCN then starts the layer-wise learning via theoptimization of CRBM energy function (1) and RBM energy (3).Particularly, the learning starts from the lowest layer, updating itsparameters until optimized. The parameters of the layer are thenfreezed, generating output and triggering the learning of the upperCRBM/RBM layer. The layer-wise update continues until all layersare optimized.

Vertebra classifier learning. Thanks to the unified featurerepresentation of MR and CT patches, we can train thevertebra/non-vertebra classification for both modalities simulta-neously in a SVM. In our system, a standard classification SVM is

r multi-model recognition system.

Page 5: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19 15

Fig. 7. The overall recognition process of our vertebra recognition system.

F esultsb of tho

tT

cCta

4

fT

tiifSasfb

Sicsbbiv

u

level 1 layer CRBM has 7 × 7 × 6 filter banks for both MR and CTmodality, it will accept image patches with size 51 × 51. The level2 CRBM has 9 × 9 × 10 filter banks. The resulting feature maps ofthe low-level feature learning unit are 60-dim feature maps. For

ig. 8. Intermediate results of the recognition process. (a) The vertebra detection ry label propagation from L5 and S1 along spinal direction (red). (For interpretationf this article.)

rained for three classes according to the TDCN trained features:/L vertebra, S vertebra, and non-vertebra.

After the complete training process, the trained TDCN + SVMan be directly applied to the blind recognition for both MR andT patches. With a patch sampling process (i.e., sliding window),he trained system can then be used for vertebra recognition forrbitrary MR/CT slices.

. The vertebra recognition

We directly apply the trained multi-modal recognition systemor full automatic vertebra recognition on arbitrary spine image.he overall recognition process is shown in Fig. 7.

Vertebra detection. As shown in the first step of Fig. 7, to simulatehe various poses of the vertebrae, we first rotate or rescale thenput MR/CT image, generating a set of articulately transformedmages. Regular patches (i.e., 51 × 51) are then randomly sampledrom the images, and are sent as input to the trained TDCN andVM. The locations that generates positive SVM output are markeds new vertebrae positions. The detection result is represented as aet of bounding boxes as green boxes for L/T vertebrae and yellowor S vertebrae in Fig. 8(a). A vertebra pose is now described by theox size and orientation.

Pose post-rectification. After obtaining the positive results fromVM (2nd-last part of Fig. 7), we can rectify the poses of the bound-ng boxes using the same mechanism of deep congealing (6). Thisan be done by the optimization (6) with the positive patch setubstituted as the input variables. The resulting aligned boundingoxes not only recover the correct poses of the detected vertebrae,ut can also join the duplicated bounding boxes near a location

nto one box. Fig. 8(b) illustrate the process post-rectification ofertebrae poses using deep congealing.

Vertebra naming. The specific vertebrae names can be obtainedsing the pose-rectified bounding boxes (last part of Fig. 7). Since

; (b) pose post-rectification by congealing (three iterations are shown); (c) naminge references to color in this figure legend, the reader is referred to the web version

the vertebra orientation has been correctly recovered by congeal-ing, one can easily link up the separated bounding boxes in pairwiseto form a spine structure using their orientations. The orientation-based linking also help to avoid the connecting with unrelatedoutliers bounding boxes from falsely detected locations. As shownin Fig. 8(c), when the bounding boxes are linked, we can easilyidentify the connecting part of T/L and S vertebra along the linkingdirections. The vertebra L5 and S1 can then be identified from theconnecting part. Starts with L5 and S1, by propagating the specificlabels along the linked bounding boxes, we can obtain all vertebralabels for the detected vertebrae. The naming of MR and CT can bedone simultaneously using the same work flow.

5. Experiments

Our method is tested on a cross modality MR + CT dataset whichcontains 60 MR volumes (T1 and T2 included) and 90 CT volumesfrom subjects with different pathologies (i.e., fracture, spondy-lolisthesis). Particularly, 30 pairs of MR–CT lumbar volumes arefrom Spineweb,1 50 CT volumes (lumbar and cervical) are from MSAnnotated Spine CT Database,2 and the rest volumes (lumbar andwhole spine) are from our supportive hospitals.

5.1. Experiment setting

Our TDCN system contains 2 CRBM layers and 2 RBM layers. The

1 http://spineweb.digitalimaginggroup.ca.2 http://research.microsoft.com/en-us/projects/spine/.

Page 6: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

16 Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19

F riginal

toiu

aPlrvrgs

ssmavtatsttra

ig. 9. Examples of vertebrae detection on MR and CT slices from SpineWeb. The o

he RBM layers in the transformed appearance learning unit, theutput of level 3 RBM is a 800-dim feature vector, and level 4 RBMs a 200-dim feature vector. The level 4 feature vectors are thensed to train the SVM for vertebra/non-vertebra classification.

For TDCN training, a set of 1150 patches (with size 51 × 51)re sampled from 6 selected MR and 4 CT volumes of our dataset.articularly, 115 patches from the set are manually selected andabeled as L/T vertebra patches by our one-click annotation. Theest patches are non-vertebra patches randomly sampled near theertebra areas of the same image slices. The proposed multi-modalecognition system is then trained using the collected patch set. Theroundtruth of the testing data is obtained by following the sameelection process.

Our slice selection follows the standard radiology protocol ofpine physician, which reflects the common clinical scenario inpine diagnosis. Unlike the groundtruth labeling of the spine seg-entation problems where manual labels in different slices will

ffect the final evaluation, slice selection in our test cannot affectertebra location selection/detection. The slice selection and ver-ebra groundtruth labeling are in separated manual processes, and

degree of distortion is acceptable thanks to our robust detec-ion. The vertebra location of a selected slice only depends on theuccess of extracting some vertebra features such as vertebra con-

ours or vertebra discs. With our TDCN-trained feature extraction,he repetitive vertebra features from different sagittal slices can beeliably identified, and thus the location of the vertebra can still beccurately recovered.

slices, the detected patches, and the naming with pose estimations are shown.

5.2. Recognition experiments

Testing is conducted by 110 (1–2 slices per volume) MR and CTsagittal slices from the 90 MR–CT volumes excluding all trainingvolumes. The overall naming accuracy we obtained is 98.1% (MR)and 96% (CT) with error less than 3.81mm.

Recognition test. The qualitative example results are presentedin Figs. 9, 10, and 11 respectively, which show the versatility of ourmethod under different imaging environments. Fig. 9 demonstratesthe overall cross-modality detection in Spineweb dataset. Fig. 10shows examples of the detection and naming for the MS Spinedatabase. The results in Fig. 11 illustrates our successful identifi-cation of location + naming + pose for all vertebrae in whole spinescans.

Our results provide a comprehensive numerical profile for thespine, which parameterizes its shape and structure identity. Thenumerical profile in turns, refine the detection results so thatnon-vertebra structures near the spine can be excluded. The geo-metric parametrization not only enables an intuitive visualizationof the spine curvature information, but also allows interactivedeformations to incorporate further spine movements. With theparameterized model, our vertebra detection results can be imme-diately used for spine registration, pathology measurement, or

bio-mechanical simulation proposes.

The quantitative results of the recognition test are evaluated bythe standard error of the detected vertebrae locations to that of theground truth locations. The evaluations on specific vertebrae are

Page 7: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19 17

F set. Int

sifadptSu

ig. 10. Examples of vertebrae detection on CT slices from MS Annotated Spine Datao S1 for the naming process of cervical scans.

hown in Table 1, while overall comparative results are presentedn Table 2. We can have naming accuracy 95% above for all datarom both MR and CT. The naming errors occurred in the resultsre due to the mis-identification of L5 and S1 in some slices. Theetection errors are constantly low (around 3 mm) thanks to theost pose-rectification which aligns the detected patches to refine

heir locations. Slight increase of detection errors are observed in1, as it cannot be aligned by its neighbor vertebrae due to itsnique appearance. As shown in Table 4, our detection results are

Fig. 11. Recognition results on whole spine MR and CT scans. The co

this test, the vertebra C1–C2 are jointly treated as a special anchor vertebra similar

competitive to the state-of-the-art detection methods for CT [4]and MR [17].

Additional quantitative comparative results are present inTable 3, where we extend the lumbar detection to thoracic andcervical spine detection using the MS Annotated Spine CT Database.This provides a more detailed regional comparison with the meth-

ods of [4,11]. As table shows, our method has the best accuracy androbustness. Our error values are much smaller than those in [4,11].Note that [4,11] consider the 3D detection scenario, which involves

mplete spine shape and vertebra labels are jointly identified.

Page 8: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

18 Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19

Table 1Recognition errors (detection and naming) of specific lumbar vertebrae on both MR and CT of the testing dataset (SpineWeb).

Vertebra S1 L5 L4 L3 L2 L1 T12 T11

MR detect. err. (mm) 5.12 3.23 2.29 3.61 2.98 3.1 3.05 2.87Naming acc. (%) 99.1 98.4 98.2 97.4 95.1 98.3 97.1 98.3

CT detect. err. (mm) 6.05 4.23 2.7 2.86 3.28 3.2Naming acc. (%) 98.9 97.1 97.9 96.5 95.0 96.3

Table 2Detection errors (mm) of recognition our method and those in [17,11]. Average error values for all vertebrae are presented in the table. Our method has the best labelingsuccess rate and detection accuracy.

Mean (MR) Std (MR) Mean (CT) Std(CT) Label Acc (MR/CT)

Ours 3.23 2.09 3.81 2.98 98.1%/96%Baseline 3.1 ([17]) 2.1 ([17]) 10.6 ([11]) 16.9 ([11]) 97.8%/86% ([17]/[11])

Table 3Comparative detection results (error, mm) for multiple spine regions (MS Annotated Spine Database). Our 2D detection provide a reliable and accurate solution for therecognition task.

Regression + HMM [4] Dense Prob. Label [11] Ours

Med Mean Std Med Mean Std Med Mean Std

Lumbar 5.4 9.7 11.2 7.6 11.5 14.1 3.5 3.0 3.4Thoracic 5.5 9.9 10.8 8.7 12.4 11.6 3.1 3.8 2.9Cervical 6.5 8.2 6.1 6.3 7.7 4.4 3.7 3.2 2.8

Table 4The vertebra/non-vertebra classification results in patch testing set.

Angle (◦) −5 12 18 −14 8Scale 1.1 0.98 0.95 1.32 0.92

mv2do

pttp

osvtsdtpfti(iTpcwce

Sensitivity (%) 90.3 89.7

Specificity (%) 92.6 90.4

ore spatial deformations/distortions and thus have larger erroralues. Instead of attempting to tackle the hard 3D problems, ourD detection provide a reliable solution that can manage most of theetection tasks in common diagnosis cases. The reliability makesur method practically useful in most spine applications.

Besides lumbar scans, whole spine MR and CT slices are tested asresented in Fig. 11 to show the generality of our method. Our sys-em successfully recognizes all vertebrae with C1, C2 excluded dueo their special appearances. In these examples, the same trainedarameters as previous tests are used.

Classifier performance. To more accurately understand the rolef vertebra classifier in our TDCN + SVM recognition system, wepecifically evaluate the performance of the SVM classifier inertebra/non-vertebra classification tests. As shown in Table 4,he vertebra classifier has high sensitivity (93.8% best) and highpecificity (92.1% best). The testing set is formed by 1000 ran-omly sampled patches from the same slice set of recognitionest, with 300 positive and 700 negative patches. In addition, eachatch centered at the sampled location is cropped with five dif-erent random poses, which leads to 5000 testing patch set inotal. The results (see Table 4) are obtained and sorted accord-ng to the poses. Our system constantly obtains high sensitivityaround 90%) and specificity (around 93%) in all poses, which makest possible to apply to detect the whole spine in the image slice.he best classification (sens. 92.1%, spec. 93.4%) is achieved in theose with rotation 8◦ clockwise and resizing by factor of 0.92. This

an be observed in the detection results of Figs. 8 and 9, evenithout the pose post-rectifications, we can still rely on the SVM-

lassifier to obtain general vertebra localizations with tolerablerrors.

88.5 91.0 92.192.6 91.3 93.8

6. Discussion

2D vs. 3D methods. There are a number of 3D vertebra detec-tion methods [4,11,18,19] that can provide vertebra identificationin 3D space, particularly for CT images. However, as reported in[11,13], 3D detection is still very challenging especially in someextreme pathological cases. Instead of competing the performancein 3D spine recognition where large detection error could occur,the proposed method in this paper focuses on the 2D spine recog-nition and provides a fast, stable and highly accurate solution forvertebra detection. The practical advantage of applying 2D detec-tion/recognition is that it is fully compatible with the currentclinical scenario in spine diagnosis, where 2D slice is still favored byspine physicians and the slices widespread in most of existing med-ical reports and diagnostic documents. A reliable 2D detection toolis much easier to be adopted considering these practical situations.Also, as reported in Tables 2 and 3, the proposed 2D approach hasbetter accuracy than the state-of-the-art 3D methods. This makesit more applicable to the measurement of spine pathological casessuch as spondylosis and spondylolisthesis. For implementation, the2D approach is faster than the that proposed in [19], as only singlescan is required in the 2D detection.

Limitations. As discussed above, the proposed method is a 2Drecognition approach such that a potential limitation is that the2D image plane might not be sufficient to capture an irregular 3Dspine shape. Some vertebrae could be missing or deformed in the

image and thus cannot be correctly detected. A remedy for thisproblem is to extend the single plane detection to multiple planes,where the adjacent slices of the sampled image will be involvedin detection. To detect the adjacent slices, one can simply apply
Page 9: Computerized Medical Imaging and Graphics · RestrictedBoltzmannMachines(CRBM)[12].CRBMisamulti-output filtering system which can adaptively update its filter weights to obtain

Y. Cai et al. / Computerized Medical Imaging and Graphics 51 (2016) 11–19 19

F ertebr(

ttdtSptswtcn

7

fTarlsf

R

[

[

[

[

[

[

[

[

[

ig. 12. The missing vertebra problem of irregular spine in an image. The missing vc) provide the complete vertebra detection for the original scan.

he same detection. The missing vertebrae can then be identifiedhrough the same process. The duplicately detected vertebrae fromifferent slices can be merged by their coordinates, and are linkedogether to form a single spine model similar to that described inection 4. An illustrative example is shown in Fig. 12. In this exam-le, the T1, T2, and T3 are missing in the single plane detection ofhe originally sampled image. By picking the adjacent slice from theame image volume Fig. 12(b), the missing vertebrae are identifiedith C7 vertebra duplicately detected. We then combine the detec-

ion result of both images and merge the duplicated vertebrae. Theombined result can be processed by following the same vertebraaming procedure in Section 4.

. Conclusion

In this paper, we proposed a multi-modal vertebra recognitionramework using Transformed Deep Convolution Network (TDCN).DCN automatically extract modal adaptive, high discriminative,nd pose invariant features for recognition. Using the TDCN-basedecognition system, we can simultaneously identify the locations,abels, and poses of vertebra structures in both MR and CT. Theystem has successfully passed the tests on multi-modal datasetsor lumbar and whole spine scans with high accuracy and stability.

eferences

[1] Pekar V, Bystrov D, Heese HS, Dries SP, Schmidt S, Grewer R, et al. Automatedplanning of scan geometries in spine MRI scans. In: MICCAI. 2007.

[2] Klinder T, Ostermann J, Ehm M, Franz A, Kneser R, Lorenz C. Automated model-based vertebra detection, identification, and segmentation in CT images. MedImage Anal 2009;13(3):471–82.

[3] Zhan Y, Maneesh D, Harder M, Zhou XS. Robust MR spine detection using hier-archical learning and local articulated model. In: MICCAI. 2012.

[

a (T1–T3) of figure (a) can be detected in additional slice (b). The combined results

[4] Glocker B, Feulner J, Criminisi A, Haynor DR, Konukoglu E. Automatic local-ization and identification of vertebrae in arbitrary field-of-view CT scans. In:MICCAI. 2012.

[5] Major D, Hladuvka J, Schulze F, Bühler K. Automated landmarking and labelingof fully and partially scanned spinal columns in CT images. Med Image Anal2013;17(8):1151–63.

[6] Roberts MG, Cootes TF, Adams JE. Automatic location of vertebrae on DXAimages using random forest regression. In: MICCAI. 2012.

[7] Stern D, Likar B, Pernus F, Vrtovec T. Automated detection of spinal centrelines,vertebral bodies and intervertebral discs in CT and MR images of lumbar spine.Phys Med Biol 2010;55(1):247.

[8] Kelm BM, Wels M, Zhou SK, Seifert S, Suehling M, Zheng Y, et al. Spine detec-tion in CT and MR using iterated marginal space learning. Med Image Anal2013;17:1283–92.

[9] Lootus M, Kadir T, Zisserman A. Vertebrae detection and labelling in lumbarMR images. In: MICCAI workshop: spine imaging. 2013.

10] Alomari RS, Corso JJ, Chaudhary V. Labeling of lumbar discs using both pixel-and object-level features with a two-level probabilistic model. IEEE Trans MedImaging 2011;30(1):1–10.

11] Glocker B, Zikic D, Konukoglu E, Haynor DR, Criminisi A. Vertebrae localizationin pathological spine CT via dense classification from sparse annotations. In:MICCAI. 2013.

12] Lee H, Grosse R, Ranganath R, Ng AY. Convolutional deep belief networks forscalable unsupervised learning of hierarchical representations. In: ICML. 2009.

13] Cai Y, Baciu G. Detecting, grouping, and structure inference for invariant repet-itive patterns in images. IEEE Trans Image Process 2013;22(6):2343–55.

14] Huang GB, Mattar M, Lee H, Learned-Miller EG. Learning to align from scratch.In: NIPS. 2012.

15] Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY. Multimodal deep learning. In:Proceedings of the 28th international conference on machine learning (ICML-11). 2011. p. 689–96.

16] Srivastava N, Salakhutdinov R. Multimodal learning with deep Boltzmannmachines. In: Advances in neural information processing systems. 2012. p.2222–30.

17] Oktay A, Akgul Y. Simultaneous localization of lumbar vertebrae and interver-tebral discs with SVM-based MRF. IEEE Trans BME 2013;60(9):2375–83.

18] Ma J, Lu L. Hierarchical segmentation and identification of thoracic vertebra

using learning-based edge detection and coarse-to-fine deformable model.Comput Vis Image Underst 2013;117(9):1072–83.

19] Cai Y, Osman S, Sharma M, Landis M, Li S. Multi-modality vertebra recognitionin arbitrary views using 3D deformable hierarchical model. IEEE Trans MedImaging 2015;34:1676–93.