Automatic Detection and Classiﬁcation of Vertebral …Classiﬁcation of Vertebral Fracture using Statistical Models of Appearance A thesis submitted to the University of Manchester

Automatic Detection andClassification of Vertebral Fracture

using Statistical Models ofAppearance

A thesis submitted to the University of Manchester for the degree ofDoctor of Philosophy in the Faculty of Medical and Human Sciences

2008

Martin G Roberts

School of Medicine

Contents

1 Introduction 20

1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.1.2 Context - Osteoporosis . . . . . . . . . . . . . . . . . . . . . . 20

1.1.3 Novel vertebral fracture detection methods . . . . . . . . . . 21

1.2 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Clinical Background 24

2.1 Osteoporosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1 Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1.2 Prevention and Treatment . . . . . . . . . . . . . . . . . . . . 28

2.1.3 Interpretation of Indicators of Osteoporosis . . . . . . . . . . . 30

2.2 Measurement of Bone Mineral Density . . . . . . . . . . . . . . . . . 30

2.2.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.2 Single Photon Absorptiometry (SPA) . . . . . . . . . . . . . . 32

2.2.3 Dual Photon Absorptiometry (DPA) . . . . . . . . . . . . . . 32

2.2.4 Dual Energy X-ray Absorptiometry (DXA) . . . . . . . . . . . 33

2.2.5 Quantitative Computed Tomography (QCT) . . . . . . . . . . 34

2.2.6 Quantitative Ultrasonography (QUS) . . . . . . . . . . . . . . 35

2

Contents

2.3 Measurement of Bone Structure and Integrity . . . . . . . . . . . . . 35

2.4 Vertebral Fractures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Vertebral Fracture 38

3.1 Vertebrae and Vertebral Fractures . . . . . . . . . . . . . . . . . . . . 38

3.1.1 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Imaging the Spine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Conventional Radiography . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Imaging with DXA and SXA . . . . . . . . . . . . . . . . . . . 43

3.2.3 Magnetic Resonance Imaging (MRI) . . . . . . . . . . . . . . 46

3.2.4 Sagittal Computed Tomography . . . . . . . . . . . . . . . . . 50

3.3 Vertebral Fracture Identification . . . . . . . . . . . . . . . . . . . . . 50

3.3.1 Semi-quantitative identification of vertebral fracture . . . . . . 51

3.3.2 Quantitative Morphometry . . . . . . . . . . . . . . . . . . . . 53

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Model Based Vision 56

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Snakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.1 Deformable Elliptical Models . . . . . . . . . . . . . . . . . . 58

4.3 Elastic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Active Shape Model (ASM) . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.1 Summary of ASM . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.2 Point Distribution Models . . . . . . . . . . . . . . . . . . . . 60

4.4.3 Active Shape Model Search . . . . . . . . . . . . . . . . . . . 64

4.5 Active Appearance Models . . . . . . . . . . . . . . . . . . . . . . . . 67

Word Count: 48996 3

Contents

4.5.1 Background to the Active Appearance Model . . . . . . . . . 67

4.5.2 Appearance Models . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.3 Fitting Appearance Models . . . . . . . . . . . . . . . . . . . 74

4.5.4 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5.5 Extensions to the AAM . . . . . . . . . . . . . . . . . . . . . 77

4.5.6 Constrained AAM . . . . . . . . . . . . . . . . . . . . . . . . 79

4.6 Model Optimisation using Minimum Description Length . . . . . . . 81

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 The consistent combination of multiple sub-model AAMs 84

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Model-Based Segmentation - Some Trade-Offs . . . . . . . . . . . . . 85

5.2.1 Statistical Models in Medical Imaging . . . . . . . . . . . . . . 85

5.2.2 Global vs Local Models . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Combining Overlapping Sub-Models . . . . . . . . . . . . . . . . . . . 86

5.3.1 Vertebral Triplet Modelling . . . . . . . . . . . . . . . . . . . 86

5.3.2 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.3 Dynamic Sub-Model Sequence Ordering Algorithm . . . . . . 88

5.3.4 Algorithm Pseudo-Code . . . . . . . . . . . . . . . . . . . . . 89

5.3.5 Updating the Constraint Variance . . . . . . . . . . . . . . . 90

5.3.6 Quality of Fit Measure . . . . . . . . . . . . . . . . . . . . . . 101

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 Vertebral Segmentation using Multiple AAMs 104

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.2 ASM vs AAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Word Count: 48996 4

Contents

6.3 Data - DXA Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3.1 Summary of Training Set . . . . . . . . . . . . . . . . . . . . . 105

6.3.2 Shape annotation . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3.3 Point correspondence . . . . . . . . . . . . . . . . . . . . . . . 110

6.4 Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.5.1 Summary of optimal AAM determination . . . . . . . . . . . . 115

6.5.2 AAM form used . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5.3 Initialisation Method and AAM profile length . . . . . . . . . 127

6.5.4 Point constraint form . . . . . . . . . . . . . . . . . . . . . . . 127

6.5.5 Optimisation of the sub-model structure . . . . . . . . . . . . 130

6.5.6 Multiple Initialisations for Fractured Vertebrae . . . . . . . . 135

6.5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7 Vertebral Fracture Classification using Shape and Appearance Pa-rameters 143

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.2 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.1 Data and Ground Truth . . . . . . . . . . . . . . . . . . . . . 144

7.2.2 Linear Classifiers - Inputs and Training Scheme . . . . . . . . 145

7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.3.1 Initial APM form selection . . . . . . . . . . . . . . . . . . . . 150

7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.6 Classifying given a semi-automatic segmentation . . . . . . . . . . . . 167

Word Count: 48996 5

Contents

7.6.1 Semi-automatic method . . . . . . . . . . . . . . . . . . . . . 167

7.6.2 Semi-automatic Results . . . . . . . . . . . . . . . . . . . . . 168

7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

8 Segmention of vertebrae in radiographs 174

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

8.2.2 AAM approach . . . . . . . . . . . . . . . . . . . . . . . . . . 177

8.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.4.1 Overall Accuracy Performance . . . . . . . . . . . . . . . . . . 180

8.4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

8.4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

9 Conclusions and Further Work 183

9.1 Summary of Original Work and Results . . . . . . . . . . . . . . . . . 183

9.1.1 AAM methodological developments . . . . . . . . . . . . . . 183

9.1.2 Vertebral Segmentation . . . . . . . . . . . . . . . . . . . . . . 184

9.1.3 Vertebral Classification . . . . . . . . . . . . . . . . . . . . . . 185

9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

9.2.1 Other modalities . . . . . . . . . . . . . . . . . . . . . . . . . 186

9.2.2 Classifer improvements . . . . . . . . . . . . . . . . . . . . . . 186

9.2.3 Automatic Detection of Search Failure . . . . . . . . . . . . . 187

9.3 Final Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

Word Count: 48996 6

Contents

A 189

A.1 Weighted fitting of shape and appearance model parameters . . . . . 189

A.2 Optimal pose parameters . . . . . . . . . . . . . . . . . . . . . . . . . 190

A.3 Weighted fitting of shape model parameters . . . . . . . . . . . . . . 192

A.4 Applying additional appearance model constraints . . . . . . . . . . . 192

Word Count: 48996 7

List of Tables

6.1 Search error statistics (point-to-line) for 6mm profile gradient AAM . 123

6.2 Search error statistics (point-to-line) for 6mm profile intensity AAM . 123

6.3 Search error statistics (point-to-line) for 6mm profile renormalised in-tensity AAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.4 Search error statistics (point-to-line) for classical region intensity AAM 124

6.5 Search error statistics (point-to-line) for classical region intensity renor-malised AAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.6 Search error statistics (point-to-line) for classical region intensity sig-moidal 2D gradient AAM . . . . . . . . . . . . . . . . . . . . . . . . 125

6.7 Search error statistics (point-to-line) for region corner feature AAM . 126

6.8 Search error statistics (point-to-line) for 6 step profile gradient AAM 128



6.11 Search error statistics using full covariance matrix for point constraints 128

6.12 Search error statistics (point-to-line) for single vertebra sub-models . 132

6.13 Search error statistics (point-to-line) for semi-triplet sub-models . . . 133

6.14 Search error statistics (point-to-line) for quintet sub-models . . . . . 133

6.15 Search error statistics (point-to-line) for single global model . . . . . 133

6.16 Shape Model Intrinsic Accuracy for Triplet sub-models . . . . . . . . 133

6.17 Shape Model Intrinsic Accuracy for Quintet sub-models . . . . . . . . 134

8

List of Tables

6.18 Shape Model intrinsic accuracy for a single global model . . . . . . . 134

6.19 Search error statistics using alternative fractured initialisations . . . . 138

6.20 Accuracy and Precision by individual vertebrae . . . . . . . . . . . . 140

7.1 Beta-convolved false positive rates (%) for the gradient appearanceclassifier as a function of variance retained in texture model . . . . . 157

7.2 Beta-convolved false positive rates (%) for the intensity appearanceclassifier as a function of variance retained in texture model . . . . . 157

7.3 Area under ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.4 False Positive Rates (%) in the mid-thoracic spine (T9-T7) for thedifferent classifiers at various sensitivities. . . . . . . . . . . . . . . . 158

7.5 False Positive Rates (%)in the lower-thoracic spine (T12-T10) for thedifferent classifiers at various sensitivites . . . . . . . . . . . . . . . . 160

7.6 False Positive Rates (%)in the lumbar spine for the different classifiersat various sensitivites . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.7 McNemar Test Statistic comparing FPR for various classifiers between93% and 97% sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.8 Overall Patient-Level FPR and Sensitivity given individual vertebraeFPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.9 Classifier Sensitivities for 1%, 2% and 5% FPR, for semi-automaticsegmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

7.10 Area under ROC curves given semi-automatic segmentation . . . . . 170

7.11 Overall Patient-Level FPR and Sensitivity given individual vertebraeFPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.1 Search Accuracy Percentiles by Fracture Status for the two profilesamplers used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

9

List of Figures

2.1 The microstructure of normal (left) and osteoporotic (right) bone. . . 25

2.2 The trabecular structure of normal (left) and osteoporotic (right) ver-tebrae. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 The variation in fracture incidence rate with age for women. Takenfrom [132] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 The lateral anatomy of a vertebra. . . . . . . . . . . . . . . . . . . . 38

3.2 The spinal column, showing the numbered cervical, thoracic, and lum-bar vertebrae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Examples of spinal radiographs . . . . . . . . . . . . . . . . . . . . . 41

3.4 This radiograph shows an osteoporotic spine with numerous severefractures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 The projection effect of lateral radiography on the spine. . . . . . . . 44

3.6 Examples of parallax effects in radiographs . . . . . . . . . . . . . . . 45

3.7 Examples of DXA images . . . . . . . . . . . . . . . . . . . . . . . . 47

3.8 Examples of vertebral fractures in DXA images . . . . . . . . . . . . 48

3.9 Appearance of verterae on a T1-weighted sagittal slice MRI image ofthe thoracic spine (T11-T9) . . . . . . . . . . . . . . . . . . . . . . . 49

3.10 Examples of non-fracture vertebral deformities . . . . . . . . . . . . . 51

3.11 The Genant semi-quantitative grading system . . . . . . . . . . . . . 52

4.1 Spine shape model variation mode 1 . . . . . . . . . . . . . . . . . . . 63

10

List of Figures

4.2 Spine shape model variation mode 2 . . . . . . . . . . . . . . . . . . . 64

4.3 Spine appearance model variation mode 1 . . . . . . . . . . . . . . . 69




4.7 L1 triplet appearance model variation mode 1 . . . . . . . . . . . . . 73

4.8 L1 triplet appearance model variation mode 1 . . . . . . . . . . . . . 73

4.9 L1 triplet profile gradient appearance model variation mode 1 . . . . 74

4.10 L1 triplet profile gradient appearance model variation mode 2 . . . . 75

4.11 Face corner feature appearance model - mode 1 variation . . . . . . . 78

5.1 Sub-model combination example - two iterations of vertebral triplets . 88

6.1 DXA image with superimposed shape annotation . . . . . . . . . . . 107

6.2 More examples of DXA images with vertebral fractures and superim-posed shape annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.3 Zoomed-in view of individual vertebral shape points . . . . . . . . . . 110

6.4 Fractured vertebral shape annotation example . . . . . . . . . . . . . 111




6.8 AAM search failure example with severe fracture . . . . . . . . . . . . 118

6.9 Example of large global contrast variation . . . . . . . . . . . . . . . 132

6.10 Mean point-to-line errors (mm) by vertebral fracture grade, comparingquintet sub-model AAMs to a global AAM . . . . . . . . . . . . . . . 134

7.1 Mid-Thoracic Spine (T7-T9) ROC Curves showing the Eastell-McCloskeyheight classifier and the shape and appearance model linear discriminants158

11

List of Figures

7.2 Lower-Thoracic Spine (T10-T12) ROC Curves showing the Eastell-McCloskey height classifier and the shape and appearance model lineardiscriminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.3 Lumbar Spine ROC Curves showing the Eastell-McCloskey height clas-sifier and the shape and appearance model linear discriminants . . . . 159

7.4 ROC Curves for combined Grade 1 Fractures showing the Eastell-McCloskey height classifier and the shape and appearance model lineardiscriminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.5 ROC Curves for combined Grade 2 Fractures showing the Eastell-McCloskey height classifier and the shape and appearance model lineardiscriminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.6 Visualisation of the (scale-free) discriminant direction in shape param-eter space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.7 Visualisation of the (scale-free) discriminant direction in appearanceparameter space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.8 ROC curves for (semi)automatically-segmented images, with all verte-brae combined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.9 ROC curves for appearance classifier on (semi)automatically-segmentedimages, for the 3 fracture grades . . . . . . . . . . . . . . . . . . . . . 171

8.1 Lumbar radiograph. a) shows the raw image (contrast enhanced); b)shows the automatically located vertebral contours superimposed. . . 176

8.2 Zoomed in view of L3 showing its shape model points . . . . . . . . . 177

8.3 L2 triplet 3SD variation in first (left) and second (right)shape modes 178

12

Glossary

AAM Active Appearance ModelABQ Algorithmically Based Qualitative method of vertebral fracture diagnosisAPM Appearance ModelASM Active Shape ModelBP BisphosphonateBMD Bone Mineral DensityCoV Coefficient of Variation, i.e. precision SD as percentage of Mean.CDF Cumulative Density Function (integral of PDF)DXA Dual Energy X-ray AbsorptiometryFPR False Positive RateHRT Hormone Replacement TherapyLD Linear DiscriminantMAD Median Absolute DeviationMRI Magnetic Resonance ImagingPCA Principal Components AnalysisPDF Probability Density FunctionPDM Point Distribution ModelQCT Quantitative Computed TomographyQM Quantitative Morphometric method of vertebral fracture diagnosisQUS Quantitative UltrasonographyROC Receiver Operating CharacteristicSD Standard DeviationSQ Semi-Quantitative method of vertebral fracture diagnosisSVD Singular Value Decomposition (i.e. of matrix)SVH Short Vertebral Height (i.e. a vertebral deformity)SXA Single Energy X-ray AbsorptiometryWHO World Health Organisation

13

Abstract

Vertebral fractures are an important diagnostic feature for osteoporosis. Howeverexisting expert diagnosis from radiological images is rather subjective, whilst currentquantitative methods require time-consuming hand annotation, and then lack speci-ficity. We develop methods from Computer Vision (the Active Appearance Model)to provide a semi-automatic segmentation method to locate the full outlines of thevertebral bodies. We split the spine up into a number of overlapping sub-models, anddevelop a novel approach to combining multiple sub-models into a consistent overallfit. The accuracy of these methods is shown to be superior to using a single globalmodel - especially in the more difficult fractured cases. Mean segmentation accuracyis comparable to manual precision, and is of the order of 0.75mm for normal verte-brae and 1mm for fractured vertebrae, although accuracy can deteriorate for severefractures. The method was applied to both lateral dual energy X-ray absorptiometry(DXA) scans, and digitised lumbar radiographs.

We develop novel fracture classification methods using the parameters of both shapeand appearance models. Linear discriminants are trained using a consensus expertreading by two radiologists as the gold standard. The classifier performance is evalu-ated on unseen DXA images. By using the appearance parameters, the false positiverates are reduced substantially compared to conventional 3-height morphometry. At95% sensitivity the appearance model classifiers give an overall false positive rate ofunder 5%, compared to 18% with conventional morphometric methods.

Institution The University of ManchesterCandidate Martin G RobertsDegree Title Doctor of PhilosophyThesis Title Automatic Detection and Classification of Vertebral Fracture

using Statistical Models of AppearanceDate 2nd May 2008

14

Declaration

No portion of the work referred to in the thesis has been submitted in support of anapplication for another degree or qualification of this or any other university or otherinstitute of learning.

15

Copyright Statement

1. The author of this thesis (including any appendices and/or schedules to thisthesis) owns any copyright in it (the “Copyright”) and he has given the Uni-versity of Manchester the right to use such Copyright for any administrative,promotional, educational and/or teaching purposes.

2. Copies of this thesis, either in full or in extracts, may be made only in accor-dance with the regulations of the John Rylands University Library of Manch-ester. Details of these regulations may be obtained from the Librarian. Thispage must form part of any such copies made.

3. The ownership of any patents, designs, trade marks, and any and all otherintellectual property rights except for the Copyright (the “Intellectual PropertyRights”) and any reproductions of copyright works, for example graphs andtables (“Reproductions”), which may be described in this thesis, may not beowned by the author and may be owned by third parties. Such IntellectualProperty Rights and Reproductions cannot and must not be made availablefor use without the prior written permission of the owner(s) of the relevantIntellectual Property Rights and/or Reproductions.

4. Further information on the conditions under which disclosure, publication andexploitation of this thesis, the Copyright and any Intellectual Property Rightsand/or Reproductions described in it may take place is available from the Headof School of Medicine (or the Vice-President).

16

Acknowledgements

I would like to thank my supervisors Prof. Tim Cootes and Prof. Judith Adams fortheir guidance, support and enthusiasm during my research.

I also thank Stephen Capener for annotation of the more recent data, and all whocontribute C++ code to the VXL library, which I have used extensively, and inparticular Prof. Tim Cootes for all his APM and AAM source code, and Dr Ian Scottfor his linear classifier training and instantiation classes, and for code for hierarchicalbootstrapped confidence intervals.

The radiologists who classified the DXA images were Prof. JE Adams (JEA) andDr Elisa Pacheco (EP). I also thank Professor Cyrus Cooper for his permission touse a set of radiographs, previously obtained in an epidemiological study under hissupervision [23].

Finally I thank both the Research Endowment in Central Manchester and Manch-ester Children’s University Hospitals NHS Trust∗ for providing initial funding for theproject, and the Arthritis Research Council (ARC) for providing current funding.

∗CMMC account 9504

17

About the Author

Martin Roberts has had a varied background. He graduated from Cambridge in1980, having read Mathematics and Theoretical Physics. He turned his mind fromthe mind-bending world of relativistic quantum mechanics to somewhat more prac-tical mathematical applications by next taking a Masters degree at Lancaster inOperational Research (OR). After this he joined the then Scicon Consultancy (nowEDS-Scicon) in London, working for five years in Operational Analysis simulations,and tracker development for the Royal Navy. He next became more of a software en-gineer, but specialising in algorithmic applications such as process monitoring (faultdetection), manufacturing control, and air traffic control tools (e.g. aircraft conflictdetection). His subsequent specialisation was sonar tracking in naval applicationsand the use of sonar in naval mine hunting operations. This was interspersed with agood deal of time exploring the Indian Himalaya. He also spent 3 years lecturing ORalgorithms in the Mathematics Department of the University of Central Lancashire,with a collaborative research interest in algorithms for predicting (from the geneticsequence) which protein segments are likely to fold into surface-active α−helices, withapplication in penicillin binding proteins†. He joined the Division of Imaging Scienceand Biomedical Engineering (ISBE) at the University of Manchester as a ResearchAssociate in 2003.

He now lives in Halifax, has two daughters, and enjoys all forms of mountaineering,on rock, ice, and mixed routes. He even likes climbing Yorkshire gritstone! He lead-climbs at about Very Severe grade on rock, or Scottish grade III/4 on mixed winterroutes, and also enjoys skiing steep couloirs.

Publications since joining ISBE

Immediately prior to registering for a PhD he published the following paper whichprovided a basis for some of the work in this thesis.

• Roberts MG, Cootes TF, and Adams JE. Linking sequences of activeappearance sub-models via constraints: an application in automated vertebral

†with collaborators Dr DA Phoenix and Dr A Pewsey

18

List of Figures

morphometry. In: 14th British Machine Vision Conference, (pages 349–358)2003.

After registering for a PhD he published the following papers related to the work inthis thesis.

• Roberts MG, Cootes TF, and Adams JE. Vertebral shape: Automatic measure-ment with dynamically sequenced active appearance models. In: 8th MICCAIConference, vol. 2, (pages 733–740). 2005.

• Roberts MG, Cootes TF, and Adams JE. Automatic segmentation of lumbarvertebrae on digitised radiographs using linked active appearance models. In:Graham J, Thacker N, and Cootes T, eds., Medical Image Understanding andAnalysis Conference, (pages 120–124) (BMVA), 2006.

• Roberts MG, Cootes TF, and Adams JE. Improving the segmentation accuracyof fractured vertebrae with dynamically sequenced active appearance models.In: 9th MICCAI Conference - Workshop on joint and bone disease, (pages 1–8).2006.

• Roberts MG, Cootes TF, and Adams JE. Vertebral morphometry: semiauto-matic determination of detailed shape from DXA images using active appear-ance models. Investigative Radiology, 41(12):849–859, 2006.

• Roberts MG, Cootes TF, Pacheco EM, and Adams JE. Quantitative vertebralfracture detection on DXA images using shape and appearance models. Aca-demic Radiology, 14:1166–1178, 2007.

19

Chapter 1

Introduction

1.1 The Problem

1.1.1 Aim

The aim of this thesis is to investigate the use of computer vision techniques to detect

and quantify vertebral fractures due to osteoporosis. Osteoporosis is a progressive

skeletal disease characterised by low bone mass and structural deterioration of bone

tissue, leading to bone fragility and an increased susceptibility to fractures, especially

of the hip, spine and wrist. Early detection of the condition can allow preventative

or therapeutic intervention.

This thesis describes investigations into locating and classifying vertebrae using sta-

tistical models of shape and appearance, with the overall aim of improving the effec-

tiveness of current methods of osteoporosis diagnosis.

1.1.2 Context - Osteoporosis

Osteoporosis is one of the most important diseases facing the elderly, and as life

expectancy increases, this makes it a serious public health problem. The estimated

lifetime risk of sustaining an osteoporotic fracture in the U.S. is 39.7% for women,

and 13.1% for men at the age of 50 [101]. By the age of 80, 70% of U.S. women are

20

Chapter 1. Introduction

osteoporotic [99].

The financial cost of osteoporosis is increasing rapidly. In the EU osteoporosis now

costs more than 4.8 billion Euros annually in hospital healthcare alone, a 33% increase

over three years [131], whilst in England and Wales, the total direct hospital cost of

osteoporotic fractures in 1999 was £584 million [131].

Postmenopausal osteoporosis is a significant cause of morbidity and mortality amongst

the elderly in the Western world, leading to large numbers of fractures of the hip,

spine and wrist. Hip fractures are the most serious and painful: 27% of women who

sustain a hip fracture die within 1 year [102]. In the U.S, the estimated lifetime risk

of hip fracture is 17.5% for women and 6.0% for men [101]. In Europe, in 2000, the

number of osteoporotic fractures was estimated at 3.79 million of which 0.89 million

were hip fractures [80]. These figures are predicted to increase, due to increasing life

expectancy.

Half of all osteoporotic fractures are vertebral: a 50 year old woman has a one in

four chance of having such a fracture, a 50 year old man about half that risk [64].

Vertebral fractures tend to occur about two decades earlier than hip and other osteo-

porotic fractures, and are often the first clinical sign of osteoporosis. The presence

of even one vertebral fracture increases the risk of any subsequent vertebral fracture

five-fold [100], and the risk of a subsequent hip fracture is doubled [9]. Proven thera-

pies are available for patients with vertebral fractures, which reduce the incidence of

subsequent fractures by 50% or more [131]. All such patients need treatment, as the

risk of further fractures is high, around 20% in the 12 months following a recent ver-

tebral fracture. Thus early diagnosis of vertebral fracture is important. Furthermore

in trials of new treatments for osteoporosis, incident vertebral fracture statistics are

studied, and used as a measure of efficacy. This provides another reason for increasing

the reliability and efficiency of vertebral fracture diagnosis.

1.1.3 Novel vertebral fracture detection methods

Currently there are quantitative approaches to diagnosing vertebral fracture, but

these rely on time-consuming and imprecise annotation of vertebrae, in order to

extract morphometric height information. Typically each vertebra is characterised

by three heights (posterior, anterior and middle), and either the heights or various

21


height ratios are thresholded. Such approaches are sometimes referred to as 3-height

morphometry. These current quantitative methods lack specificity (see chapter 3

for a thorough disussion), as well as being time-consuming; whereas a widely ac-

cepted method of semi-quantitative expert reading suffers from subjectivity, and is

less suitable when not practised by skilled radiologists (e.g. other medical special-

ists, general practitioners or specialised radiographers). Therefore we have developed

more specific and reliable quantitative methods applied to spinal images acquired by

dual energy x-ray absorptiometry (DXA) scans. We have also successfully applied

the segmentation phase of these to spinal radiographs.

The first stage is to automatically and accuractely segment the vertebral bodies. We

use methods from Computer Vision (the Active Appearance Model), and develop

these methods to accurately locate the full outlines of the vertebral bodies. To

do this we have split the spine up into a number of overlapping sub-models. We

have developed a novel approach to combining multiple sub-models into a consistent

overall fit. We assess the accuracy of these methods, comparing several different sub-

model structures, and show our approach is superior to using a single global model -

especially in the more difficult fractured cases.

We have developed novel and more specific fracture classification methods using the

parameters of both shape and appearance models (see Chapter 4). Linear classi-

fiers are trained, and their performance evaluated on unseen images using miss-1-out

tests. By using the appearance model parameters, the false positive rates are reduced

substantially compared to existing quantitative methods (3-height morphometry).

1.2 Overview of Thesis

Chapter 2 describes osteoporosis, how it is detected, measured and treated, in order

to provide the context of the project.

Chapter 3 continues the clinical background with a critical review of current meth-

ods of diagnosing vertebral fracture.

Chapter 4 consists of a literature review of computer vision techniques for robustly

segmenting structures in medical images, leading up to the Active Appearance Model

of Cootes et al which we subsequently use.

22


Chapter 5 introduces our development of the AAM methodology to allow the con-

sistent combination of multiple overlapping sub-model AAMS which are fitted in a

data-dependent sequence. We use a constrained form of the AAM in order to incor-

porate the linkage between the sub-models. The linking and sub-model sequencing

algorithm is described in general terms.

Chapter 6 next describes our use of multiple but overlapping Active Appearance

Models to accurately segment vertebrae in DXA images, using the multi-AAM ap-

proach described in Chapter 5. We present results on our DXA dataset and optimise

the AAM form and sub-structures used. We also analyse some of the failure cases

(often these are severe fractures), and propose methods of improving the performance

in difficult cases.

Chapter 7 presents our novel methods of vertebral fracture detection, using linear

classifiers trained on both shape and appearance parameters. The latter encode useful

information about texture around the endplate, and it is shown that this, in addition

to a more complete and subtle shape description, leads to a marked improvement in

specificity compared to existing quantitative (morphometric) methods.

Chapter 8 presents related results on segmentation accuracy of our methods applied

to lumbar radiographs.

Chapter 9 draws some conclusions from the work and outlines areas requiring future

development.

23

Chapter 2

Clinical Background

2.1 Osteoporosis

Osteoporosis is a progressive skeletal disease characterised by low bone mass and

structural deterioration of bone tissue, leading to bone fragility and an increased

susceptibility to fractures, especially of the hip, spine and wrist. Osteoporosis has

many causes, the most common of which is a deficiency in oestrogen production after

the menopause in women. This deficiency causes a loss of both cancellous (trabec-

ular or spongy) and cortical (compact) bone. Most bones comprise a hard cortical

shell, within which is a fine net of trabeculae (strands), which improve bone strength

whilst adding little weight. The loss of trabecular bone is known as postmenopausal,

or ‘Type I’ osteoporosis. This usually begins after the menopause in women between

the ages of 55-65 and gives rise to fractures in skeletal sites that are rich in tra-

becular bone: especially vertebral and wrist fractures. A decrease in cortical bone,

which occurs approximately 15 years later in life, is known as senile, or ‘Type II’

osteoporosis, occurs in both men and women as age advanvces, and leads to fractures

of the hip. Poor diet can also contribute to osteoporosis - for example, deficiencies

in calcium, protein and vitamins C and D adversely affect bone health. Many drug

therapies, such as anticoagulants, glucocorticoids, and hormones used in therapeutic

doses, have side-effects which accelerate bone loss. Osteoporosis can also be caused

by a wide variety of other conditions that affect the remodelling of bone, such as

abnormalities in the endocrine system (e.g. hyperparathyroidisn, hyperadrenalism

and hypogonadism). Osteoporosis can even appear in childhood (osteoporosis imper-

24

Chapter 2. Clinical Background

Figure 2.1: The microstructure of normal (left) and osteoporotic (right) bone.

fecta), due to rare inherited forms of the disease (abnormal Type I collagen) which

result in poor bone formation. Maintainance of bone mass requires regular loading

of the bone, which encourages bone development and remodelling. Individuals who

are less active are therefore more likely to become osteoporotic. Almost any chronic

illness can lead to bone loss, with inactivity and malnutrition being major factors.

The mechanism for osteoporotic bone loss is complex and only partially understood.

Bone mass is maintained by osteoblasts, which form bone, and osteoclasts, which

resorb bone from the skeleton. Trabecular bone, which has the highest surface area,

and largest metabolic activity, is remodelled at a greater rate than cortical bone, and

is therefore lost more rapidly from the skeleton when there is an imbalance between

bone formation and resorption. Trabeculae are reduced in number in osteoporotic

bone and the spacings between trabeculae are greater. Hence the bone becomes me-

chanically weaker. Figure 2.1 shows the difference in micro-architecture between

normal and osteoporotic bone, whilst Figure 2.2 shows the change in the microstruc-

ture of trabecular vertebral bone which occurs with osteoporosis.

Bone mass (in healthy individuals) increases from childhood until the early 20’s,

remains static up to age 40-50 years, after which it declines. In men, this decline is

fairly gradual, but in women, the decline is particularly rapid immediately after the

menopause. This decline in bone mass with age is reflected in an increase in the rate

of osteoporotic fractures in the elderly population.

25


Figure 2.2: The trabecular structure of normal (left) and osteoporotic (right) vertebrae.

2.1.1 Epidemiology

Postmenopausal osteoporosis is a significant cause of morbidity and mortality amongst

the elderly in the Western world, leading to large numbers of fractures of the hip,

spine and wrist. The number of osteoporotic fractures in the U.K. has been estimated

at 200 thousand per annum [49].

As the elderly population grows, due to advances in healthcare and demographic

changes, so the proportion of women (and men) suffering from the disease will con-

tinue to increase. Other trends have meant that the number of osteoporosis sufferers

has increased at an even greater rate than that expected from demographic changes

[2].

Of all fractures due to osteoporosis, hip fractures are the most serious and painful.

Approximately 27% of women who sustain a hip fracture die within 1 year [102],

while half will suffer long-term pain and disability [50]. The incidence of hip fractures

increases dramatically with age, as not only does the bone strength decrease due to

osteoporosis, but also individuals become more prone to falling [109]. Figure 2.3

shows the incidence for women of vertebral, hip and wrist fractures as a function of

age.

The prevalence and incidence of vertebral fractures are difficult to measure, as ver-

tebral fractures can be asymptomatic [43]. Estimates of vertebral fractures must

be extrapolated from epidemiological studies. It appears that typically only severe

vertebral fractures actually result in back pain [47], as the prevalence of back pain

26


Figure 2.3: The variation in fracture incidence rate with age for women. Taken from [132]

actually declines after age 50, although the prevalence of vertebral fractures increases.

The detection of mild vertebral fractures is additionally complicated by the fact that

no reliable criteria exist to define them, and they are easily confused with other mild

deformities. There is also evidence that vertebral fractures on radiographs are of-

ten not reported [60, 42], or else not acted upon, partly due to the wide variety of

terminology used by radiologists.

In addition to causing pain and deformity to their sufferers, osteoporotic fractures

place a large burden upon national healthcare systems. Over 1.3 million osteoporotic

fractures occur annually in the United States [75]. In Europe in 2000 the number of

osteoporotice fractures was estimated to be 3.79 million, of which 0.89 million were

hip fractures. The financial cost of dealing with osteoporotic fractures in the United

States was estimated to be US$ 20 billion in 1988 and US$ 35 billion in 1998. In

Europe the financial cost was estimated to be 31 billion Euros in 2005; whilst in

England and Wales the cost was £542 million in 1999, and is currently in excess of

£1 billon in the UK. These figures will increase with the number of elderly in the

population in the coming years, providing further impetus for early detection and

effective treatment of osteoporosis. The earlier and more reliably the disease can be

detected, the more patients can benefit from strategies for prevention and treatment.

27


The scale of the disease, and its increasing prevalence makes its detection, prevention

and treatment important.

Knowledge of an individual’s lifestyle and medical history can help to detect patients

at high risk of osteoporosis [23, 85]. The most powerful risk factors for osteoporotic

fractures include having low premenopausal oestrogen due to stress, excessive exercise

or anorexia nervosa, and being thin. Dietary and lifestyle factors which increase

the likelihood of osteoporosis include the excessive intake of cigarettes, caffeine, and

alcohol, and a low intake of calcium and vitamin D. Some drugs are also known to

increase risk. An individual’s bone mineral density (BMD) has the most influence on

her/his risk of osteoporotic fracture [119, 137].

2.1.2 Prevention and Treatment

To some extent it is possible to prevent, or at least delay, the disease by avoiding

many of the known risk factors, such as excessive tobacco and alcohol, which have

other negative health consequences. Adequate calcium, vitamin C and D intake, and

regular moderate exercise are also important for maintaining high BMD.

Hormone replacement therapy (HRT) used to be given to women at menopause with

established low bone density. This has been shown to reduce bone loss immedi-

ately after the menopause [48, 134]. Several cohorts and case control studies suggest

that HRT reduces fragility fracture risk by 30 to 50% [134], but that the effect

is lost within 5 years after discontinuation of HRT. However, HRT has side effects

(breast tenderness, uterine bleeding, increased risk of deep venous thromboembolism

and cardiovascular events), and its prolonged use increases the risk of breast can-

cer. Therefore HRT is no longer considered as a first line therapy for prevention

of postmenopausal osteoporosis, except for women who underwent the menopause

before the age of 45. Preliminary results from the Womens HOPE study indicate

that smaller doses of conjugated equine oestrogens (CEE) and medroxyprogesterone

acetate (MPA) are sufficient to slow down the bone turnover and to inhibit bone loss

in early postmenopausal women [89]. Long term evaluation of the side effects of this

regimen is not yet available.

Bisphosphonates (BP) are potent inhibitors of bone resorption through effects on

osteoclast resorption. They are used in a variety of metabolic bone diseases including

28


osteoporosis. BPs have a poor intestinal absorption, a long skeletal retention and can

induce mild gastrointestinal disturbances. The three bisphosphonates most frequently

used in the treatment of osteoporosis are etidronate, alendronate and risedronate.

There are other bisphosphonates under study e.g., ibandronate and zoledronate.

Alendronate (10 mg/day) was found to increase BMD, decrease levels of biochemical

markers of bone turnover, and decrease, by about 30-50%, the incidence of fragility

fractures [12]. The anti-fracture efficacy has been shown both in women with preva-

lent vertebral fractures and in women with low BMD (T-score ∗ < −2) but without

vertebral fractures [12, 10]. Risedronate (5 mg daily) decreases the incidence of new

vertebral and peripheral fractures by the same extent as alendronate in women with

prevalent vertebral fractures [74, 112]. In osteoporotic women 70 to 79 years of age,

risedronate decreased the incidence of hip fracture by 40% [96] . Histomorphome-

tric study in patients treated with risedronate for five years supports its excellent

long-term bone safety [129].

The first effective stimulator of bone formation, the recombinant 1-34 fragment of hu-

man parathyroid hormone [rhPTH(1-34)], has recently been approved. Teriparatide

is indicated for the treatment of osteoporosis in postmenopausal women who are at

high risk of a fracture. It also appears to increase bone mass in men with primary

or hypogonadal (low testosterone level) osteoporosis who are at high risk of fracture.

rhPTH(1-34) decreases the incidence of new vertebral fractures and nonvertebral frac-

tures by 65% and 53% respectively in osteoporotic women with prevalent fractures

[105].

Thus proven therapies exist for patients with vertebral fractures which reduce the

incidence of subsequent fractures by 30% to 65%. All patients with prevalent vertebral

fractures require treatment, as there is good evidence that the risk of further fractures

is extremely high, around 20% in the 12 months following a recent vertebral fracture.

Thus early detection of osteoporosis is important, and the early detection of prevalent

vertebral fracture is an important diagnostic feature.

With all new osteoporosis therapies, it is essential that safety and efficacy can be

evaluated rapidly and thoroughly in large multi-centre trials, in order to benefit

patients as soon as possible. The numbers of incident vertebral fractures that occur

for the trial group are used as a measure of efficacy. Furthermore osteopososis can

∗T-score BMD measure is explained shortly when we discuss the measurement of BMD

29


be a side-effect of treatments for other conditions: for example chronic glucocorticoid

use [138]. There is also therefore a need for improvements in the detection of incident

fractures during large clinical trials.

2.1.3 Interpretation of Indicators of Osteoporosis

When interpreting indicators of skeletal status, it is important to consider how all

the information available about a patient relates to the patient’s risk of fractures. In

particular age has a significant effect upon how BMD values are interpreted. An 80

year old woman’s BMD may be well below that of a healthy 50 year old, but her

BMD may be above average for her age. Her immediate risk of fracture is much

greater than the 50 year old, but the cumulative risk of her suffering a fracture in the

rest of her life may be less than that of the 50 year old. When considering whether

an individual requires treatment for osteoporosis, both immediate and longer term

risk of osteoporotic fracture are usually considered. The WHO recommend use of a

fracture risk assessment for the next 10 years, using a multi-factor prediction model

including (inter alia) BMD, age, height loss, exposure to systemic glucocorticoids,

parental fracture history, and current fracture status [82].

To detect osteoporosis, and to monitor the effects of treatments on the disease, one

is therefore interested in any measurement that can be performed which relates to

the current and future risk that a patient might suffer an osteoporotic fracture. Such

measurements include direct assessment of bone quantity in the skeleton, measure-

ments of bone turnover using biochemical markers, which help predict future bone

loss, and measurements of bone structure, which contributes to bone strength.

2.2 Measurement of Bone Mineral Density

2.2.1 Purpose

The amount of bone in an individual’s skeleton has been shown to be a very powerful

predictor of the risk of a fracture [78]. In 1994, an operational definition of osteoporo-

sis was proposed by the World Health Organisation (WHO) with diagnostic criteria

of fragility based on the measurement of bone mineral density (BMD) and on the

30


presence of fractures [81]. There are four categories:

1. Normal: BMD not more than 1 standard deviation below the young adult

mean.

2. Low bone mass (osteopenia): BMD between 1 and 2.5 standard deviations

below the young adult mean.

3. Osteoporosis: BMD more than 2.5 standard deviations below the young adult

mean.

4. Severe osteoporosis (established osteoporosis): BMD more than 2.5 standard

deviations below the young adult mean in the presence of one or more fragility

fractures.

The normalised score BMD−µR

σRis referred to as the T-score, where µR, σR are the sam-

ple mean and standard deviations in a reference population of young adults. Note

different reference values are used for men and women. This pragmatic definition

in terms of T-score clearly has limitations, as the cut-offs are somewhat arbitrary.

This definition was established for postmenopausal Caucasian women and may not

be applicable to men or women from other ethnic groups, who moreover may have

different population statistics for the “normal” mean and standard deviation. Fur-

thermore the variance of peak BMD depends on the measurement site, and so the

prevalence of osteoporosis (according to this diagnostic) depends on the measurement

site. We discuss in subsequent sections the need to also incorporate measures of bone

structure, strength, and fracture status. In particular a patient in the early stages of

osteoporosis could be in the osteopenia category, but have a number of mild vertebral

fractures, and should be diagnosed as in fact osteoporotic.

Despite limitations in defining osteoporosis in terms of BMD scores, BMD is still

a powerful predictor of osteoporosis. There have been great advances over the last

decade in non-invasive techniques for very accurate measurement of bone mineral at

a range of skeletal sites. The method most commonly in use at present is dual energy

X-ray absorptiometry (DXA), which has superseded single and dual energy photon

absorptiometry (SPA and DPA). Ultrasound scanning, which is portable, and involves

no ionising radiation, is a promising technology for bone mineral measurement; how-

ever it is not sufficiently reliable at present for routine clinical use. Improvements in

31


its precision and accuracy may enable it to become a valuable measurement tool in

the future. We now discuss methods of measuring BMD.

2.2.2 Single Photon Absorptiometry (SPA)

Single photon absorptiometry [127] used rectilinear scanning with a beam of 27.3 keV

photons produced from an Iodine 125 source, and a collimated detector to measure

transmitted photons. It was the first method of direct bone mineral measurement

devised, and measured the rate of absorption of photons passing through bone. The

absorption of photons can be related to density and depth of bone through which

the beam has passed. Scanning was performed in a water bath, to correct the effect

of overlying tissue. Bone mineral was measured as ‘bone mineral content’ (BMC)

in grams, measuring the amount of bone in the path of the beam. By dividing this

measure by the projected area (Ap) in the path of the beam, one can also measure

bone mass per unit area (g/cm2), known as areal bone mineral density (BMD). BMD

is the most common measure of bone mass measured by densitometers.

SPA was best suited to the measurement of bone at peripheral skeletal sites, such

as the forearm or heel. To scan more clinically relevant sites, with more overlying

fat and soft tissue, better correction is required. This can be achieved if scanning is

performed at two separate energies. SPA has now been superseded by dual energy

methods (first DPA then DXA, see below).

2.2.3 Dual Photon Absorptiometry (DPA)

Dual photon absorptiometry [140] (now obsolete) measured absorption of photons at

44 and 100 keV, produced by a Gadolinium 153 source. As tissue and bone have

different absorption coefficients at the two energies, the component of absorption

resulting from bone rather than tissue could be calculated. This enabled sites such as

the spine and femoral neck, which are surrounded by soft tissue, to have their bone

density reliably measured.

By performing DPA in a scanning mode, it could be used as an imaging modality,

although it suffered from very low resolution and poor signal/noise, as photon flux

was low and calculation of bone content involves subtraction of images at separate en-

32


ergies. This method has in turn been succeeded by dual energy X-ray absorptiometry

(DXA).

2.2.4 Dual Energy X-ray Absorptiometry (DXA)

DXA operates in a similar fashion to DPA, except that a low output X-ray tube is used

instead of a radionuclide source. The use of X-rays enables significantly higher photon

flux to be achieved, resulting in lower noise and improved image quality, shorter scan

times (5 minutes per site), and improved precision of BMD measurements (1-2%).

DXA is capable of measurement of BMD at a large range of skeletal sites, including

the arms, legs, spine, hip and pelvis. Whole body BMD can also be measured. Like

the earlier SPA and DPA, DXA measures “areal” BMD (in g/cm2), which depends

on both volumetric BMD and bone dimensions; but as fracture risk depends on both

bone mineralisation and bone size, areal BMD is a good predictor of fracture risk.

The accuracy and reproducibility of DXA are better than those of other densitometric

methods [63]. There is a good correlation between BMD at different measurement

sites, but the best predictor of the risk of fracture at a site is the BMD measured at

that site [36]. Sites for BMD measurement in clinical practice are the lumbar spine

(L1-L4) and the hip (femoral neck and total hip).

Initially DXA used a pencil beam and a single detector moved in a raster across

the site of measurement. Modern DXA scanners employ fan-beam X-ray sources

and a bank of detectors to image a whole line array simultaneously. This allows

faster scanning (approximately 3-5 minutes per site) than rectilinear (‘pencil’ beam)

systems, and has improved image quality and spatial resolution. DXA is currently

one of the most effective and reliable methods of measuring bone density, and its use

is increasing. There are around 27,000 central DXA scanners worldwide. Its only

major disadvantage is the use of ionising radiation, albeit at a low dose ( only 1-6µSv

for BMD measurement, 7µSv for single-energy mode spine imaging, and 42µSv for

dual energy spine imaging, compared to 500µSv for conventional radiography, [14]).

Imaging artefacts can cause inaccuracies in DXA areal BMD measurements [1], most

commonly in the lumbar spine. Degenerative disc disease with osteophytes, or os-

teoarthritis with hyperostosis of the facet joints can falsely elevate BMD; laminectomy† would falsely reduce the BMD of the affected vertebra; vertebral fracture can also

†the removal of the laminae and spinous process

33


falsely elevate BMD (same BMC as before fracture, but Ap is reduced). A similar

effect of reduced projected area can cause overestimation of BMD at the hip (due to

patient positioning), if there is inadequate internal rotation of the femur (resulting

in foreshortening of the femoral neck and reduction of Ap). As DXA uses the soft

tissues as a reference, errors in BMD can also arise if the patient is excessively under-

or overweight.

As the image quality of DXA scans improved, so measurements based upon image

structure rather than intensity (BMD) became feasible. The measurement of verte-

bral shape and other bone dimensions can now be performed from DXA images with

reasonable accuracy, as will be further discussed in 3.2.2.

A review of DXA technology and clinical use is given in [1].

2.2.5 Quantitative Computed Tomography (QCT)

In Quantitative Computed Tomography (QCT) [62] a radiation source produces X-

rays that pass through the patient to a detector on the opposite side. The source and

detector rotate about the imaged volume, and the attentuated X-rays are obtained

as a set of 2-D projections. Mathematical reconstruction algorithms are then used to

reproduce the 3-D representation of the spatial variation in attenuation within the

imaged volume. Calibration phantoms, made of different concentrations of calcium

hydroxyapatite in water-equivalent plastic, are used to convert attenuation to true

volumetric bone mineral density. QCT is the only method regularly in use which

enables volumetric bone mineral density (g/cm3) to be measured, but at the cost

of increased radiation dose. Precision of QCT-measured ‘true’ BMD is excellent (c.

1% CoV). Single energy QCT systems for measurement of bone mass have been in

use since the early 1980s. The 3-D nature of the technique means that QCT allows

examination of the separate contributions of cortical and trabecular bone. Since the

trabecular bone is normally weakened first in osteoporosis, this makes it a sensitive

technique for detecting vertebral bone loss [72], but the method requires careful

calibration using a reference phantom. The most common application of QCT to bone

densitometry is the direct measurement of trabecular bone in the lower vertebrae of

the spine (L1-L3) using general purpose CT scanners. Specialised scanners have also

been developed to measure BMD in peripheral skeletal sites. These obtain 1.2mm

sections through the region of interest, in an effort to reduce dose. Thus QCT is an

34


established and useful tool in the measurement of site-specific bone density. The fat

content of trabecular bone means that dual-energy QCT can also be used to further

improve accuracy, but at the cost of higher dose and poorer precision.

2.2.6 Quantitative Ultrasonography (QUS)

In quantitative ultrasonography (QUS) bone density is measured using two param-

eters of ultrasound transmission: speed of sound (SOS) and broadband ultrasound

attenuation BUS [44]. Most equipment measures these parameters at the calcaneus,

phalanges of the fingers, tibia and patella. Most scientific data on QUS fracture

prediction has been obtained at the calcaneus. Correlation between QUS and DXA

BMD is modest [44], and the predictive power of QUS for osteoporotic fracture is

slightly lower than BMD. Despite the limited range of sites at which this technique

can be applied, its low cost, portability, and lack of ionising radiation mean that it

may become a practical alternative to X-ray based methods for routine screening,

although the technique is used primarily as a research tool at present. QUS has yet

to be widely used in clinical practice [44], possibly because its long term precision

is rather low. It is temperature-dependent, and there are not reliable phantoms for

cross-calibration between scanners.

2.3 Measurement of Bone Structure and Integrity

Bone density is not the only determinant of bone strength. BMD alone is insufficient

to determine bone strength [59], and there is a considerable overlap of BMD for

patients with and without fragility fractures [6]. Bone shape and structure also

affect its strength, and hence the likelihood of fracture. Bone shape can affect the

chances of a future fracture in one of two ways. Firstly, a shape change may have

resulted from an osteoporotic fracture itself (such as in the spine), indicating that

damage has already taken place. Secondly, a bone’s shape may affect the stress it

experiences under normal loading. For example, the natural variation in shape of

the femoral neck means that some individuals are at greater risk of hip fracture than

others, purely as a result of their femoral neck shape. Those with greater hip axis

length on DXA are at greater risk of hip fracture [51, 52].

35


Measures of the micro-structure of the trabeculae can also help in assessing the bone

strength. Various techniques have been established for describing changes in bone

structure and shape at a range of skeletal sites, and many of these have been shown

to be useful in improving fracture prediction [106, 67]. For example Smyth et al [125]

developed a method of characterising the texture of the proximal femur which had

a moderate correlation with expert radiologist evaluation of the Singh index [124].

Gregory et al [70, 69] have developed a combined predictor of hip fracture based

on BMD, shape descriptors, and texture descriptors of the bone micro-architecture.

Other trabecular bone structure descriptors of the distal radius, the calcaneus, and

the spine have been reported to improve fracture risk evaluation when combined with

BMD [90]. Recent developments in QCT technology (periperal pQCT scanners) allow

high resolution (82µm voxels) of the distal radius and allow individual trabeculae

to be visualised. This allows a variety of histomorphometric measures: trabecular

bone volume fraction, trabecular thickness, spacing and number. It has been found

that pQCT-derived trabecular density at the distal radius is significantly different

in osteopenic women with a history of fracture than in those without [16], whereas

DXA-derived BMD was not able to distinguish between these fracture/non-fracture

groups.

Magnetic Resonance Imaging (MRI) (see also section 3.2.3) can be used to derive

bone structure estimates. Bone tissue has a very low water content, and additionally

protons within bone tissue matrix have a short T2 relaxation time - an MRI measure

reflecting the chemical environment of the protons. As a result bone gives no signal

in standard MRI. In high-resolution T2 weighted MRI images, bone tissue appears

black while bone marrow (because of its water and fat content) produces a high

intensity signal within the inter-trabecular spaces. Thus the trabecular network can

be visualised indirectly through marrow visualisation. MRI derived T2* was found

to have a predictive value in differentiating between healthy women and osteoporotic

women with mild fractures [37]. High resolution MRI bone imaging is most commonly

performed at peripheral sites (heel and wrist), but recent developments have allowed

high-resolution imaging of the proximal femur [86]. Morphology parameters and

fractal analysis from high-resolution MRI images can be used to detect differences in

trabecular structures with age, BMD and osteoporotic status [92, 91].

The use of combined structure/BMD measurements is likely to become increasingly

widespread as technological improvements begin to allow the clinical community to

move towards multiple factor risk assessments. A summary review is given in [84].

36


But for now these methods are still more in the research stage than in clinical use.

2.4 Vertebral Fractures

As vertebral fractures can occur relatively early in the progression of osteoporosis,

their presence is a very important diagnostic. The presence of a (non-traumatic)

vertebral fracture is itself an indicator of a loss of bone strength. In trials of osteo-

porosis treatments or prevention regimes in peri- and postmenopausal women, the

rate of vertebral fracture is used as a quantitative measure of the effectiveness of

the method under trial. The diagnosis of vertebral fracture is however by no means

straightforward, and is considered in more detail in the next chapter.

37

Chapter 3

Vertebral Fracture

3.1 Vertebrae and Vertebral Fractures

Vertebrae are box-like structures which form the spine, supporting the body’s weight

whilst allowing flexibility. The basic (lateral) anatomy of a vertebra is shown in

Figure 3.1.

The spine is composed of several sections (Figure 3.2). In this work we will be

concerned with vertebral fractures occuring in the thoracic and lumbar spine. In

the chest region the thoracic spine attaches to the ribs. There are 12 vertebrae in

the thoraic spine, denoted T1 (uppermost) to T12. In the lumbar spine there are

normally five vertebrae, denoted L1 (uppermost) to L5.

A vertebral fracture commonly occurs when the inner structure of the vertebral body

(made from cancellous trabecular bone) has been weakened by osteoporosis, and the

spinous process

vertebral body

superior vertebral notch

inferior vertebral notchinferior articular facet

superior articular facet

cortical bone

trabecular bone

Figure 3.1: The lateral anatomy of a vertebra.

38

Chapter 3. Vertebral Fracture

Figure 3.2: The spinal column, showing the numbered cervical, thoracic, and lumbarvertebrae

39


vertebra breaks under only very mild trauma. Typically the central endplate collapses

(or is pushed by the expanding nucleus pulposus) into the vertebral body. Most mild

fractures take the form of a concave central depression of the endplate. Often the

cortical rim of the vertebra remains intact, at least in the early (mild) stages of the

fracture. In lateral view this gives rise to a complex edge appearance, where the

vertebral ring appears as an exterior edge, and the depressed endplate as an inner

edge with a rather diffuse appearance. Fractures may then progress to become wedge-

like (anterior height reduction) if there is fracture of the vertebral ring and vertebral

body cortex. Finally the posterior cortex may also collapse giving rise to a crush

fracture where there is wholesale loss of height. Vertebral fracture is a continuous

process, rather than a simple dichotomy (fractured or not), and in its early stages it

can be difficult to diagnose.

Figure 3.3 shows the typical appearance of normal and fractured vertebrae imaged

using conventional radiography, together with rather ambiguous cases which could be

early mild fractures, but which could also be some other form of mild deformity. Fig-

ure 3.4 shows a radiograph of a spine displaying symptoms of advanced osteoporosis

with numerous severe fractures.

3.1.1 Significance

Low energy vertebral fractures can be a source of back pain, although they can often

be asymptomatic [47], and hence undiagnosed. The presence of even one vertebral

fracture increases the risk of any subsequent vertebral fracture five-fold [100], and

the risk of a subsequent hip fracture is doubled [9]. Vertebral fractures also result in

height loss, kyphosis and morbidity.

Pharmaceutical trials commonly use the presence of vertebral fractures as an entry

criterion for diagnosis of established disease. Although vertebral fracture incidence is

a less powerful and precise indicator of osteoporosis than bone mass, the fact that it

more directly indicates bone strength means that it is a more trusted validation for

treatments. In a report from the 2003 MORE study, the presence of severe vertebral

fracture was the strongest predictor of future vertebral and non-vertebral fractures

[41]

40


T9

T7

Figure 3.3: The left hand image is a good quality spinal radiograph showing generallyhealthy vertebrae. However both T9 and T7 (2nd and 4th from bottom) may be early mildfractures, though they could also be slightly deformed for some other reason. The righthand image shows an image with a moderate vertebral fracture

41


Figure 3.4: This radiograph shows an osteoporotic spine with numerous severe fractures.

42


3.2 Imaging the Spine

3.2.1 Conventional Radiography

The conventional method of imaging the spine in order to detect vertebral fractures,

is to perform lateral X-ray radiography. It is not possible to image the complete

spine on a single X-ray film, so normally both a lumbar and a thoracic radiograph

are taken to cover all vertebrae from L4 to T4.

Lateral radiography causes a projection effect due to the divergent beam, resulting in

enlargement and distortion of the vertebrae furthest from the centering point when

imaged. The effect of this is shown in Figure 3.5, in which vertebrae above or below

the X-ray centre point are magnified and distorted, as a result of the X-ray beam no

longer passing laterally through the vertebra, but at an angle from above or below.

If the patient is not correctly aligned (spine parallax, or perhaps has a scoliosis,

then there may be apparent tilting of the vertebral bodies. This is sometimes called

the “bean-can” effect, and leads to the vertebrae rims being distorted into quasi-

elliptical shapes. In extreme cases adjacent vertebral bodies may even appear to

inter-penetrate. Such radiographs are very difficult to interpret, and sometimes false

positive diagnoses of vertebral fracture may result from the distorted perspective,

which tends to give the endplate rim the concavely depressed appearance typical of

endplate fracture. Figure 3.6 shows a radiograph with a moderate tilting effect, and

a more seriously malaligned radiograph.

Although spinal radiographs give high resolution, they expose patients to a high

radiation dose, and the need to use two overlapping radiographs to visualise the

whole spine makes it difficult to identify the vertebral levels reliably.

3.2.2 Imaging with DXA and SXA

The use of lateral spine images, obtained with fan-beam X-ray bone densitometry

systems, offers a potential practical alternative to radiographs for clinical analysis

of vertebral fractures. DXA (and SXA) scanners, described in Section 2.2.4, use a

parallel beam geometry to eliminate the projection effect of conventional radiography.

As DXA scanners have come to be used for measurement of bone dimensions to assess

43


(patient facingaway from page)

vertebrae in spine

image of vertebrae distorteddue to projection

source

film

Figure 3.5: The projection effect of lateral radiography on the spine.

osteoporosis, so the criteria for choosing the scanning method have changed. Such

morphometric measurements can also be performed using DXA scanners operating in

a single energy (SXA) mode, which can give a quicker scan time with some equipment.

However SXA imaging tends to give a noisy picture of the thoracic spine, due to soft

tissue associated with the lungs and ribs, and diaphragm motion artefacts. A review

of DXA technology and clinical use is given in [1].

Figure 3.7 shows vertebrae imaged using Dual Energy X-ray Absorptiometry (DXA),

showing both a good quality image of a healthy spine, and a severely osteoporotic

spine in which almost every vertebra is fractured. Figure 3.8 shows an image with

several vertebral fractures, together with a zoomed-in view of the fractured T7-T9

vertebrae. The horizontal lines appearing in the middle of some of these images are

diaphragm motion artefacts due to breathing during the exposure. DXA involves a

substantially lower radiation dose than conventional spinal radiographs [14], avoids

the projectional parallax effects of conventional radiographs due to the divergent

X-ray beam, and has the whole spine on a single image. DXA vertebral fracture

assessment can be combined with a bone mineral density (BMD) assessment using

the same equipment. If the scanner has a C-arm then the lateral spine images can

be obtained in the supine position without repositioning the patient into the lateral

44


L4

T12

L4

L1

Figure 3.6: Apparent tilting and parallax effects on lumbar radiographs. The left handimage shows modest beam misalignment in the upper part of the radiograph (T12/L1).The right hand image shows more serious parallax effects.

45


decubitus position ∗. Although the best resolutions available from DXA and SXA are

still significantly worse than that available using conventional radiography (0.35mm

vs 0.01mm for radiography), good agreement is obtained between morphometric mea-

surements on DXA and radiographs [130], and between expert reading of DXA images

and radiographs for moderate and severe fractures [110, 8, 56, 58, 20]. There are,

however, discrepancies over mild fractures, and there can be problems in visualising

the upper thoracic vertebrae on DXA (above T7), though there tend to be fewer ver-

tebral fractures in this region. Thus visual vertebral fracture evaluation, using lateral

spine images obtained with a DXA device, in the absence of, or as a pre-screen for

conventional radiography, can significantly improve osteoporosis risk evaluation. The

combined evaluation of vertebral fracture status and BMD could become the stan-

dard for patient evaluation, particularly in older postmenopausal women for whom

vertebral fractures are common but may be asymptomatic. A proportion of cases

where differential diagnosis is not possible on DXA, or where patient obesity causes

a particularly poor signal to noise ratio, would need referring for conventional spinal

radiography.

3.2.3 Magnetic Resonance Imaging (MRI)

Magnetic Resonance Imaging (MRI) is a non-ionizing modality that uses the inter-

action of a strong magnetic field with the spin alignment of ionized hydrogen atoms

(i.e. protons), typically in water, to produce an image. Sequences of radiofrequency

pulses are used to produce 3-dimensional images. The interactions between protons,

gradient fields, and radiofrequency pulses allow an image to be created based on

spatial frequency encoding. Hydrogen, present in musculoskeletal water, is the most

frequently studied component by MRI. Bone tissue has a very low water content and

as a result bone gives a low signal value in standard MRI. MRI imaging can therefore

be used for vertebral morphometry. Normally the T1-weighted sagittal slice is used.

Figure 3.9 shows the typical appearance of vertebrae on an MRI image.

Goh at al [66] evaluated the precision of MRI-based vertebral morphometry on a

dataset of 220 mid sagittal T1-weighted MRI images. They found a high degree of

precision (c. 2% CoV) for vertebral body heights. They concluded that MRI-based

morphometric analysis is likely to be superior to using radiographs, given the selective

planar nature of MRI imaging. Unlike radiological studies, MRI investigations of the

∗i.e. lying on one side with the knees tucked up in a somewhat foetal position

46


Figure 3.7: The left hand image is a good quality DXA image showing a healthy spinewith points marked for 6-point vertebral morphomery. The right hand DXA image showsa severely osteoporotic spine, with almost every vertebra fractured, and poorer signal tonoise ratio probably due to low BMD or patient obesity

47


Figure 3.8: The left hand DXA image shows a spine with multiple fractures in boththoracic and lumbar spines, whilst a zoomed-in view of the fractured T7-T9 is shown onthe right

48


Figure 3.9: Appearance of verterae on a T1-weighted sagittal slice MRI image of thethoracic spine (T11-T9)

thoracic spine enable visualisation of the upper thoracic segments (above T4). More

importantly, MRI vertebral morphometry data may be examined and interpreted in

relation to other spinal pathologies or clinical findings, such as malignancy, infection,

and disc degeneration.

A similar study by Tomomitsu et al [133] comparing vertebral height measurements

with T1-weighted saggital MRI to radiographs for the lumbar spine concluded that

mild biconcave fractures could be better detected by MRI than by radiographs. How-

ever the conclusion was a non-sequitur, as the morphometric criteria used for fracture

diagnosis were based on thresholds defined for radiographs, and as discussed in the

next section are notoriously non-specific in any case. The study showed that MRI

tended to produce values of anterior and posterior height that were greater than those

obtained from radiographs, whereas the value of central vertebral height tended to

be lower in the MRI images (which explains why more morphometric “fractures”

were diagnosed on MRI!). Nevertheless the ability of the modality to allow accurate

height information was established. Although the cost of the technique has reduced

in recent years, access to MRI scanners still tends to be restricted to higher priority

uses, and so the use of MRI for vertebral fracture evaluation is not yet established in

clinical practice.

49


3.2.4 Sagittal Computed Tomography

Bauer et al [5] investigated the use of sagittal reformations of axial computed tomog-

raphy datasets in the identification of vertebral fractures. The conclusion was that

sagittal CT reformations could more accurately assess vertebral fractures than stan-

dard radiographs, as long as the thinner available slice thicknesses were used (<3mm).

A recent study by Williams et al also [139] concluded that vertebral fractures could

be seen in sagittal slice reformations from 3D volume QCT images, although the

sensitivity of axial CT images was found to be poor. Although the original 3D axial

images may have been acquired for other purposes, it was recommended that the

sagittal reformation be examined by the consulting radiologists for signs of vertebral

fracture.

3.3 Vertebral Fracture Identification

The diagnosis of vertebral fracture can be challenging and somewhat controversial,

and there is no absolute “gold standard” for the definition of vertebral fracture. A

review of the various diagnosis methods is given by Guermazi et al [71]. Established

methods of vertebral fracture definition focus mainly on the identification of short

vertebral height as the indicator of a vertebral fracture. This can be problematical,

especially in the case of mild fractures for two reasons. The change in height cannot

be determined longitudinally (from a single image), and short vertebral height is not

specific to osteoporotic fracture. Short vertebral height may be a long-standing devel-

opmental abnormality, or due to for example Scheuermann’s disease or degenerative

disc disease; or normal vertebrae may exhibit misleading appearances on radiographs

due to parallax when the vertebral bodies are projected obliquely. Figure 3.10 shows

potentially confusing radiographs, one displaying reduced height due to Schmorl’s

nodes, and the other apparent wedging due to remodelling in degenerative disease.

Existing quantitative morphometric methods (discussed further below) based on pos-

terior, anterior and middle heights are particularly prone to false positives on non-

fracture short vertebral height deformity. Therefore the Genant semi-quantitative

method [65] (and see next paragraph) has become almost a de facto gold standard

for fracture assessment, but there still remains significant subjectivity, particularly

for mild (grade 1) fractures. The subjectivity problem is discussed at length by Jiang

et al and by Ferrar et al [79, 54], who propose an Algorithmically Based Qualitative

50


Figure 3.10: Examples of non-fracture vertebral deformities. The left hand image shows anumber of vertebrae with Schmorl’s nodes, and the right hand image shows the wedge-likeappearance caused by remodelling due to degenerative disease

diagnosis method (ABQ). In addition to the issue of subjectivity, another problem

is the inadequate number of radiologists in some countries to interpret radiographs

[7]. Furthermore, newer scanning technologies (DXA) are becoming available in units

other than radiology departments. Therefore it is desirable to define a quantitative

approach which can capture at least some of the more subtle information used in

expert visual assessment.

3.3.1 Semi-quantitative identification of vertebral fracture

Identification of fracture in the Genant semi-quantitative (SQ) method [65] is based

on both the appearance of apparent reduction in vertebral body height, and also the

identification of radiological characteristics of fracture at the vertebral endplate. The

evaluation of the change in height is standardised so that a fracture is identified if

vertebral height appears reduced by more than 20%, and fractures are graded into 3

grades (mild, moderate, severe) corresponding to height loss of 20-25%, 25-40%, or

51


Figure 3.11: The Genant semi-quantitative grading system

more than 40% respectively. Fractures can also be assigned a type (endplate, wedge

or crush) as illustrated in Figure 3.11. However height reduction alone is not intended

to be the sole criterion, and expert assessment of the radiological characteristics of

fracture should also be used: changes at the endplate and cortical margin, lack of

consistency with adjacent vertebrae, and lack of parallelism of the endplates. There

is a need for greater clarification and consistency of these radiological characteristics

of fracture, compared to other diagnoses. Otherwise there is a danger of too much

subjectivity. Nevertheless studies have shown good inter-radiologist concordance in

applying SQ, when the radiologists doing the assessment had been explicitly trained

in the technique [87, 141, 11]. On the other hand the proportion of mild vertebral

fractures identified by SQ tends to be greater than for other methods. For example in

the Study of Osteoporotic Fractures (SOF) almost four times as many mild vertebral

deformities were identified by SQ than by various quantitative approaches [142], and

in another study [11] mild vertebral fractures identified by SQ in men were found not

to be correlated with BMD measurements.

Jiang et al and Ferrar et al [79, 54] have proposed an Algorithmically Based Qualita-

52


tive diagnosis method (ABQ), in order to address the rather subjective assessment of

radiological characteristics of fracture in SQ. In ABQ more stress is placed on the con-

cavely depressed appearance of the collapsed endplate as being the fundamental way

of distinguising an osteoporotic fracture from other forms of short height deformity.

As already discussed above this tends to produce a more diffuse or multiple-edged

appearance to the vertebra, whereas for example in wedging due to degenerative

disease the edges of the vertebral bodies remain relatively crisp. ABQ has a rec-

ommended flowchart of various differential diagnoses. In a recent study [55] it was

found that women with ABQ-identified vertebral fractures had significantly lower

BMD (and age-adjusted BMD) than those women who had been diagnosed with ver-

tebral fracture by other methods, but were regarded as short vertebral height (SVH)

deformities (not fractures) by ABQ. Similarly the ABQ-positives had better correla-

tion with other osteoporosis surrogates (history of non-vertebral fracture, low weight

and self-reported height loss).

3.3.2 Quantitative Morphometry

Melton [21] developed definitions of vertebral fractures utilising percentage reduc-

tions in ratios of anterior, middle or posterior heights of vertebral bodies compared

with normal values for that particular vertebral body. Eastell [45] modified this

method, defining fractures on the basis of standard deviation reductions instead of

fixed percentages. These were standard deviation derived from an (outlier-trimmed)

normal population. A 3 SD threshold was proposed. The Eastell/Melton criteria

define endplate fractures when the mid-height to posterior height ratio is below a

threshold, and wedge fractures when the anterior to posterior height ratio is below

a threshold. Crush fractures are defined by taking the ratio of the posterior height

to that of both neighbours and then thresholding both of these. One problem in the

crush criteria is that by using two ratios it doubles the chances of a false positive.

There can also be problems in identifying crush fractures if the neighbouring verte-

brae are fractured. Also false positive endplate or wedge fractures can be identified

because the posterior height is unusually large. These consideration led McCloskey

[95] to propose a number of modifications to the Eastell/Melton standard criteria,

including the use of predicted posterior heights and the addition of more complex

criteria in order to reduce false positives.

In the McCloskey method the expected posterior height is predicted from up to four of

53


its neighbours in a way which eliminates the use of fractured neighbouring vertebrae

as predictors. The algorithm proceeds up the spine † and predicts a vertebra’s pos-

terior height from up to four of its nearest neighbours. Each prediction assumes that

vertebral heights scale in the same ratio as the ratios of their mean heights, so some

mean reference data are required (typically obtained from a trimmed population of

healthy subjects). This gives up to four predictors, but some of the four neighbours

may be excluded from the prediction step. Firstly any vertebra below the currently

considered one that has already been classified as a crush fracture is excluded. Sec-

ondly the maximum of the predicted heights from the remaining neighbours is taken

as an initial base against which to compare the other remaining predictors. If the ra-

tio of any of the remaining predictors to this maximum is less than the crush fracture

threshold, then it too is excluded. The final prediction is the mean taken over the

remaining set of predictors. This gives a baseline posterior height against which to

assess crush ratios. Furthermore a vertebra is only considered to have an endplate or

wedge fracture if the ratio of the mid-height or anterior height to both the posterior

and predicted posterior heights is below threshold. These additional criteria tend to

reduce false positives caused by large posterior height.

Minne [103] has developed an approach that assesses the presence and severity of

vertebral fractures by comparing vertebral heights that have been normalised for

body size by dividing all values by the corresponding values of T4. These results are

then compared to a normal range based on the values for healthy young women.

The main weakness of QM is its lack of specificity, because QM treats all height re-

ductions as indicative of vertebral fracture. Poor differentiation between true fracture

and non-fracture deformity by QM mainly affects mild deformities [141, 11, 68]. Wu

et al. [141] compared SQ and a morphometric approach using different fracture crite-

ria for the detection of incident fracture. A consensus reading by three observers using

SQ was used as the gold standard. There was only fair to moderate agreement be-

tween quantitative morphometry and visual interpretation (the highest kappa score

was 0.63). The authors concluded that the assessment of incident fractures with

morphometric techniques may not be sufficiently reliable, and that visual reading

by a trained observer should also be performed. In a comprehensive study, Black

et al. [11] compared four different morphometric techniques in 3,013 spine films.

In addition, SQ was compared with the morphometric approach in 502 cases. The

agreement between the semi-quantitative approach and the quantitative approaches

†most caudal vertebra first

54


was moderate. There was a high concordance between quantitative morphometry

and the semi-quantitative evaluation for fractures defined as moderate or severe by

semi-quantitative reading. There was, however, a significant discordance for fractures

designated mild in the semi-quantitative reading. In a recent study by Ferrar et al

[55] into the ABQ method with a fracture-enriched population, it was found that

those women who were diagnosed with osteoporotic fracture by ABQ correlated well

with other measures of osteoporosis (BMD age-adjusted z-score, weight and height

loss); whereas those ABQ-negatives with QM fractures (often mild wedges) were not

significantly different to the normal population with regard to these measures. The

conclusion was that morphometric diagnosis alone is insufficiently specific.

3.4 Conclusion

There is a significant degree of under-reporting of vertebral fractures, but problems

remain in the differential diagnosis of osteoporotic fracture. There is a large degree

of subjectivity in the application of the most widely accepted semi-quantitative tech-

nique, especially in distinguising mild fractures from other forms of SVH deformities.

The newer ABQ method shows good evidence of better distinguishing between frac-

tures and other SVH deformities. Nevertheless, as there is both a world-wide shortage

of radiologists [7] to perform expert reading of difficult cases, and newer equipment

such as DXA scanners are now becoming available outside of radiology units (e.g. in

general practice), or are being used in pre-screens by less-skilled operators (e.g. spe-

cially trained radiographers), there is also a need to advance quantitative techniques

of fracture diagnosis. The first requirement is to (semi)-automatically determine the

full vertebral shape, and secondly take account of more subtle shape information

(concave depression), and textural features in some way, in order to use at least some

of the visual clues used in expert reading. Although some difficult cases will always

require differential diagnosis by an expert radiologist, there is also a role for better

pre-screening of the cases that need such referral.

Later in this thesis we develop methods of automatically locating vertebrae and using

the parameters of statistical appearance models to achieve this. In the next chapter

we review the model-based vision methods upon which our work is based.

55

Chapter 4

Model Based Vision

4.1 Introduction

Medical images often provide noisy and sometimes incomplete data, and typically deal

with complex and variable structures. Sometimes additional structures are overlayed,

for example soft tissue or ribs can overlay vertebrae in the thoracic spine. For such

reasons low-level feature detection processes such as edge detection alone tend to be

unreliable when applied to medical or other biological images, as there are invariably

gaps and spurious fragments. The most effective approaches explicitly describe the

shape of object boundaries and introduce constraints on these, whilst also allowing a

degree of flexibility, in order to accommodate shape variation. These class of methods

are known as Deformable Template methods

Deformable Template methods are local optimisation schemes for locating required

contours in images, which typically represent the boundaries of objects. The various

schemes try and match a deformable model to an image by some form of local opti-

misation, typically of some form of “energy” functional. They all have an aspect of

incorporating prior knowledge about the kinds of allowable shape. Thus they have a

certain resistance to seduction by the noise or clutter typical of medical images, and

can “fill in” missing parts of the shape (e.g. where an edge is obscured or weak) by

a form of extrapolation.

Because the contours can deform to fit a wide variety of shapes, these methods are

often used in medical image analysis. This is because biological structures and organ-

56

Chapter 4. Model Based Vision

isms are naturally deformable, and their shape varies considerably. During the search

the image data, desired contour properties (or prior knowledge of possible shapes),

and also constraints are incorporated. The combination of these three features into

a single search give these algorithms a good chance of success, where for example a

purely data-driven search would be unreliable.

In this chapter we briefly review some early methods of Deformable Templates leading

up to the Active Shape Model of Cootes et al [27]. We then introduce the later

Appearance Models of Edwards et al [46, 24] and its fitting via the Active Appearance

Model [26].

4.2 Snakes

Snakes [83] incorporate a low degree of prior knowledge - typically a measure of

allowable local curvature and elasticity. The essential idea in the seminal paper by

Kass [83], was to take a form of feature map and treat it as a landscape in which a

“snake” (i.e. the contour) could slither around. Prior information on possible shapes

could be added in two ways. Kass recognised that the classical Euler equations for an

elastic string moving in an external field could be used in computer vision. The prior

information is relatively weak, and enforces smoothness on the snake by adding an

“internal” energy term, representing the potential energy stored in the snake due to

stretching and bending. The snake is described by means of a parameterised variable

s on [0,1]. The elastic energy adds in terms in the first derivative w.r.t s; whilst

bending energy increases with the second derivative (i.e. curvature is penalised).

Reducing the curvature coefficient β over a section of a snake allows a corner to more

easily develop there. The external energy depends on the image data. For example

when the snake is seeking a strong edge, an appropriate energy term to add is

E = −|∇I|2 (4.1)

The equilibrium configuration is the minimum energy total energy contour. As the

total energy is integrated over the snake this is a classical calculus of variations

problem, soluble via the Euler-Lagrange equations for the snake energy functional.

Many methods have been proposed to improve the original snake of Kass. These typ-

ically involve adding additional terms to the energy functional. Cohen [22] proposed

57


an internal “inflation” force to expand a snake past spurious inner edges towards the

real edges of the sought structure, thus making the segmentation less sensitive to the

starting position. Poon [107] used simulated annealing to avoid entrapment by local

minima. Chakraborty and Duncan [19] introduced region-based information in order

to decrease seduction by insignificant edges. A review is given in [97].

4.2.1 Deformable Elliptical Models

Just as snake-type models have improved on simple image edge detection by enforcing

constraints on likely object boundaries, so the explicit incorporation of a priori knowl-

edge of shape variation into deformable models can enable the problem of matching

the model to the image to be further constrained. Staib and Duncan’s approach [128]

used stronger prior shape information than snakes, but it was somewhat less general,

requiring closed curves. It was applied to tracking heart motion on echocardiagrams.

The model was based on the elliptical Fourier decomposition of the shape boundary,

and can be viewed as a linear combination of basis functions which are all ellipses, of

varying scale, orientation, and phase shift.

The high frequency content of the decomposition was discarded. For an application

fitting to echocardiagrams of a heart ventricle, only the first four harmonics were

retained. This smoothes any segmentation derived as a model fit. By fitting the

model to an ensemble of shapes, a prior distribution can be derived for the model

parameters. Separate probability distributions were derived for each Fourier coeffi-

cient on the basis of the training set contours. But the probabilities of individual

Fourier coefficients are not necessarily independent over the training examples, with

the result that if two coefficients are strongly correlated, unrealistic combinations

of coefficients could still be considered likely by the model. The segmentation was

performed by using essentially a strongest edge search, but constrained by the prior

distribution of coefficients.

4.3 Elastic Models

Nastar and Ayache [104] also used a set of orthogonal basis functions, but these were

based on ideas in mechanics of the normal modes of vibration of a body. Their model

58


was not simply of a contour, as elastic coupling between non-adjacent parts creates

a model of a volume, not just its surface. This also stops model solutions potentially

collapsing, as a snake could be inclined to, but it is necessary to set up a detailed

physical model of the body (or area) in terms of masses connected by springs, in

the manner of finite element analysis. The authors put the Newtonian equations

for damped coupled harmonic oscillators in linear form, thus decomposable in terms

of eigenmodes. However there are a large number of essentially ad hoc parameters

controlling elasticity terms and pseudo-masses.

4.4 Active Shape Model (ASM)

4.4.1 Summary of ASM

In the Active Shape Model (ASM) of Cootes et al [27], the equivalent of the template

is a statistical shape model. This requires a training set of representative images.

Points of interest (e.g. significant anatomical landmarks, corners, and strong edges)

must be manually annotated. The annotation must of course be in a consistent

manner (i.e. the same points in different images need to correspond).

Once the images have been annotated, the training shapes are first co-registered onto

a common scale using Procrustes Analysis, to first remove variation which can be

reduced to rigid-body transforms (translation, rotation, and (isotropic) scaling).

The basic insight of the ASM was that the distribution of all (co-registered) shapes

in the training set can be approximated by the first two moments of the joint dis-

tribution. So the mean (co-registered) shape is derived. Individual variation is then

expressed in terms of the residuals from the mean. Deviation from the mean is es-

sentially captured in the residual covariance matrix, as there is of course substantial

correlation between the variation in points. It is precisely this correlation - which

descibes how points tend to move together - that implicitly captures the shape con-

straints inherent in the training set. The covariance matrix is diagonalised, thus

performing Principal Components Analysis on the training set.

The model of all possible shapes of the required type is then simply the mean shape

plus a weighted sum of the first m principal components. Thus the model is linear,

59


and uses an orthogonal basis. Also the eigenvalues can provide bounds on attain-

able shapes, either singly; or by assuming a Gaussian model via a hyper-ellipsoid

whose axes are parameterised by the (square rooted) eigenvalues. Thus strong prior

constraints are easily associated with the Shape Model.

4.4.2 Point Distribution Models

The Shape Model used in the ASM is a form of Point Distribution Model (PDM). We

now summarise this in mathematical form. We deal only with the 2-dimensional case,

but the method can be generalised to shapes in 3 dimensions or more (sometimes a

temporal sequence is represented as a 4D shape).

Let xi be a vector of length 2n describing the n points of the ith shape of the training

set, given by:

xi = (xi1, xi2, xi3, xi4, . . . , xi2n−1, xi2n) (4.2)

where (xi2j−1, xi2j) is the Cartesian coordinate of the jth landmark point on the ith

training example.

The training set shapes are then aligned to remove variability due to translational,

scaling, and rotation (i.e. similarity transforms). In principle more general alignment

can be done (e.g. using affine transforms to allow shearing). To align the set of N

training shapes, each shape is initially aligned with the first shape in the training set.

The mean shape of the training set is calculated, and all shapes are then aligned to the

mean, whereupon the mean is recalculated. The alignment and mean recalculation

are iterated until the alignment converges, using a summed square distance as the

alignment metric. Having found a scaling, translation and rotation to align each

training shape to the mean shape, aligned shapes are used in subsequent analysis.

First the mean aligned shape is calculated, and then the deviations about that mean

are evaluated, whence the covariance matrix is computed. Mathematically, having

obtained N aligned shapes x1 . . .xN , and calculated the mean shape

x =1

N

N∑

i=1

xi (4.3)

60


the deviations of each example from the mean are calculated, given by

di = xi − x (4.4)

Then the covariance matrix C is given by

C = ddT (4.5)

Principal component analysis (PCA) [93] is then applied to these deviations, effec-

tively fitting a 2n-D ellipsoid to the distribution of the training vectors d in the 2n-D

space. This is done by calculating the 2n eigenvectors pk (k = 1, . . . , 2n) of C , cor-

responding to the principal axes of the ellipsoid, with their corresponding eigenvalues

λk such that

Cpk = λkpk (4.6)

and λk ≥ λk+1. The eigenvectors with the largest eigenvalues correspond to the axes

that describe the largest variation in the shapes. The eigenvectors are orthogonal

and therefore uncorrelated, and can each be thought of as an independent mode of

variation. Each eigenvalue gives the amount of residual variance associated with

the corresponding mode. Furthermore the dimensionality can often be substantially

reduced by retaining only the most significant components (i.e. the ones with the

largest eigenvalues). For example to retain 98% of the original variance it is necessary

to choose the cut off at the eigenvalue m so that

m∑

k=1

λk >= 0.98Tr(C) (4.7)

There are many methods of diagonalising C, but one of most numerically robust is

to use Singular Value Decomposition (SVD) [108]. One problem can be that there

are insufficient training examples to be able to obtain a matrix of full rank, but by

transposing the problem it is possible to derive the N (training set size) eigenvectors

of the N by N pseudo-covariance matrix:

C′ = dTd (4.8)

It can be shown that the eigenvalues of C′ are also eigenvalues of the (degenerate)

61


C. Also given the eigenvectors P′ of C′, the reduced set of N eigenvectors P(N) of C

for the non-zero set of eigenvalues is given by

P(N) = dP′ (4.9)

SVD is a very robust method when smaller-than-ideal training set sizes lead to C

being not positive definite, and can be applied as described above to C′ in this

situation.

By taking the first m eigenvectors in order of decreasing eigenvalue (i.e. most variance

explained first), the principal shape variation observed can be described by a 2nxm

matrix Ps, containing only these eigenvectors:

Ps = (p1p2 . . .pm) (4.10)

Any shape x similar to those in the training set, and allowable by the modes of

variation given by the principal axes of the hyper-ellipsoid can be generated using

x = x + Psb (4.11)

by varying a vector of weights b = (b1b2 . . . bm)T . These are known as shape param-

eters, and completely describe the range of possible shapes allowable by the PDM.

Given another shape x′ within the subspace defined by the set of eigenvectors, the

required shape parameters b′ are given by

b′ = PsT (x′ − x), (4.12)

since Ps is orthogonal, as the eigenvectors of a symmetric matrix are always orthog-

onal.

Even if x′ does not lie within the subspace, the above solution for b′ is still optimal

in a least square sense. When applying shape models the parameters are normally

additionally constrained in that the elements bk of b are kept within some limits, such

as between −3√

λk and +3√

λk, where λk is the kth eigenvalue, and hence√

λk is the

standard deviation of bk over the training set. This would define a hyper-cuboid in the

shape-space. This can still give rise to unplausible shapes if all parameters are close to

their extrema, so a better method is to constrain to a hyper-ellipsoid by limiting the

Mahalanobis distance of b′. The orthogonality of the Principal Components means

62


Figure 4.1: Spine shape model variation mode 1

that covariance terms are zero, and the Mahalanobis distance D is given by:

D2 =i=m∑

i=1

b′2

i

λi

(4.13)

Under the assumption that the distribution of shape parameters over possible valid

shapes is Gaussian, D is proportional to the log-likelihood of this distribution. So to

prevent an unlikely configuration, a limit can be placed upon the maximum Maha-

lanobis distance allowed.

As an example of a shape model, Figures 4.1 and 4.2 show the change in shape

for the vertebrae from L4 to T7 as the first and second shape parameters of a shape

model of the spine vary through two standard deviations either side of zero.

63


Figure 4.2: Spine shape model variation mode 2

4.4.3 Active Shape Model Search

4.4.3.1 Summary of ASM Search

The Active Shape Model is a method of searching for the best set of pose parameters

and (constrained) shape model weights to generate an allowed shape which best

matches the image evidence. In order to have a criterion of best match, it is necessary

to supplement the shape model described above with a texture model. This is a set

of models (one per shape point) which models the grey level texture around each

point. The ASM proceeds by locally moving points along normals to the current

shape boundary, to locate the best profile match for each point. Then this set of

points is fitted to the constraining shape model. We now define the algorithm in

more detail.

4.4.3.2 Modelling Image Grey Levels

The ASM models the appearance of structures around the object boundaries using

a profile model of the image brightness perpendicular to the boundary. Typically

the brightness levels in nearby pixels will be correlated, and so again to model the

expected profile observed over the training set at a particular landmark point, a

64


Principal Component model (similar to that used for PDMs) is constructed.

Rather as the shapes were co-aligned using a similarity transform, each of the grey-

level profiles can be co-aligned over the training set by using a linear transform (offset

and linear scaling) to model gross variation in brightness and contrast. Each pixel’s

grey-level model is constructed after this normalisation,

Having performed PCA on the training profiles, and selected the number of modes

required to explain most of the variance between training profiles, the jth profile can

be modelled as

gj = gj + Ptjbj (4.14)

where gj is the mean profile for model point j, Ptj the truncated, orthogonalised

covariance matrix, and bj a vector of weights.

It is necessary to define a measure of the quality of match between the image evidence

and the profile model.

In fitting the model to the profile g, bt = PTtj(g− g) is the parameter vector used to

describe the profile. gfitted = g + Ptjbt is the closest approximation of the model to

the profile. The residual r is given by r = g − gfitted. The fitness measure is given

(from [28]) by

f =

i=mt∑

i=1

b2i

µi

+

k=n∑

k=1

r2k

vk

(4.15)

where µi is the eigenvalue of mode i and vk describes how well the kth profile point

was modelled, and is the mean variance of each training profile from the best fit of a

model trained from the remaining training profiles. So this ‘fitness’ measure describes

the extent to which a sampled profile in the image differs from the profile model in

both ways observed in training (i.e. Mahalanobis distance - see Equation 4.4.2); and

in ways which are not explicable by the training data (residual variance). The second

term of the measure describes the residual deviations of the sampled profile from

the model, relative to the model’s residual variance over the training set. Arguably

these variances should ideally be estimated using jackknifed leave-one-out train/refit

schemes; but in practise it is usual to use a somewhat biased estimate by using the

refitting residual variance over the model’s own training set.

65


4.4.3.3 Combining Grey Level and Shape in Search

For each model point the ASM searches along the local normal over some search

distance to locate the point which allows texture model parameters to best match

the local texture model to the sampled grey level texture. At each iteration a set of

best new points are thus obtained. These are located in a manner which is greedy

on a per-point basis. The overall shape must still conform to the shape model, so

next the best overall shape model fit to the new point set is used. In effect this

also smooths the shape. Also the allowed shapes are bounded by the shape model

hyper-ellipsoid; and if the least squares fit falls outside the bound, the parameters

are adjusted so the fit lies on the closest point of the allowed volume. This gives the

ASM good robustness in the presence of noise. The ASM search continues to iterate

in this manner until the solution converges.

Local minima in the ASM search objective function can be reduced by smoothing

the objective function by considering the image at multiple scales. This has the

additional advantage of enabling profiles to be searched for over longer distances.

During training, a pyramid of images of multiple resolutions is created. These are

generated by using Gaussian smoothing, followed by sub-sampling - normally in as-

cending multiples of 2. Separate grey-level profile models are generated for each level

of the multi-resolution pyramid by sampling from the appropriate subsampled im-

age, instead of the original full resolution image. This process is identical to that for

training a conventional local grey-level model, but using a lower resolution image,

and scaling the positions of the landmark points appropriately to the smaller image.

The local grey-level profiles are sampled over the same number of pixels as before,

making them effectively longer when viewed in the original resolution image.

To perform multi-resolution image search, conventional search is started in a reduced

resolution image. When search has converged in that image, the shape model is

projected into a higher resolution image, where search continues. This process repeats

until search has converged in the full resolution image.

66


4.5 Active Appearance Models

4.5.1 Background to the Active Appearance Model

There are several weaknesses in the way that the ASM treats shape and texture.

Firstly separate texture models are generated for every point, but in practise texture

at points close to each other will be correlated. Secondly there will be correlations

between shape and texture which are not modelled - the degree of plausibility of a

particular set of textures might depend on the shape, but the ASM has an in-built

bias to try and locate points that appear to have a surrounding texture close to the

mean. Thirdly there is a degree of local greediness about the way the ASM initially

moves each point separately to its own local optimum, before applying the shape

model consraints. Fourthly because only profile models are considered there is no

use of texture information within the interior of the shape but not intersected by a

profile.

Active Appearance Models arose as a way of overcoming these shortcomings, and

also as extensions of earlier work on Eigenfaces [135] and Active Blobs [121]. In the

Eigenface approach of Turk and Pentland, texture is modelled by extracting image

texture from a window containing a face, and then projecting the texture vector

into a subspace obtained by performing PCA over a training set of such windowed

faces. Rather than using a simple window, the AAM extracts texture from within

the convex hull of a shape described using a Shape Model, and also involves warping

to try and obtain a “shape-free” texture model.

4.5.2 Appearance Models

There are a number of AAM variants. We describe first the Appearance Model of the

original “classical” AAM. Unlike the ASM this samples texture from within the entire

region of the shape (actually its convex hull). Because the shape varies, the AAM

tries to define a shape-free patch from which to sample image texture. This is done

by using a triangulation mesh between the points of the shape model, that allows the

image patch to be covered by triangles. Each triangle is then affine warped so that

the three control points defining the triangle are in their mean positions. The process

of warping all the triangular segments warps the whole image patch to that defined by

67


the mean shape. An input to the AAM building process is the total number of pixels

to be included in the model. In effect this defines how many points are taken from

each triangle in the mesh. After warping to the mean shape each training image has

this set of sample points in correspondence, and the pixel values at all these points

are concatenated into one overall texture vector gi (for training image i). The texture

vectors are co-aligned for gross brightness and contrast variation using a linear shift

and scaling to the mean in a similar manner to the Procrustes alignment of shapes.

Again, just as for the shape model, PCA is performed on the aligned texture vector

covariance matrix. This results in a texture model of the form:

g = g + Ptbt + r (4.16)

The next step in the appearance model building is to perform a tertiary PCA on

both the shape and texture parameters. Firstly in each training example the resul-

tant shape and texture parameters are concatenated into a single combined vector,

but with a rescaling weight w on the shape components to bring them into a com-

mensurate scale with the texture parameters. Thus we have combined vectors b(a)i

with:

b(a)i = (wbs1, wbs2, ...wbsms

,bt1,bt2, ...,btmt)T (4.17)

The scaling w is typically chosen so that shape and texture parameters contribute

an equal variance to the total variance of the combined vector parameters. Then to

model shape-texture correlation PCA is performed again on the combined vectors to

arrive at a compact “appearance model”. Typically some significant dimensionality

reduction can be obtained by retaining only 99% of the overall variance. The new

matrix of orthogonal eigenvectors is designated as Q, and we then have the linear

model

b(a) = b(a) + Qc + ra (4.18)

where c are the appearance parameters. Constraints are applied to the appearance

parameters in a similar way to the shape model parameters. These implicitly con-

strain both shape and texture, but are somewhat stricter than constraints on either

alone, in that correlation between the two has been modelled. So certain forms of

texture can only occur with certain shape deformations.

Figures 4.3 to 4.6 show the resultant variation in the appearance of the spine as

68


Figure 4.3: Spine appearance model variation mode 1

the first four appearance parameters vary by 2.5 standard deviations about zero.

Interestingly the second mode appears to be largely a gross variation in brightness

across the image.

Later in this thesis we discuss how we fit together appearance models of smaller

subsections of the spine, which use triplets of vertebrae. Figures 4.7 and 4.8 show

the resultant variation in the appearance of the L1-centred appearance model as the

first two appearance parameters vary by 2.5 standard deviations about zero.

Note that the linear nature of the models allows us to express the shape and grey-

levels directly as functions of c. Firstly we row-decompose Q into the ms shape

parameter rows and mt texture parameter rows, so

Q =

(Qcs

Qct

)(4.19)

69



70



71



72


Figure 4.7: L1 triplet appearance model variation mode 1

Figure 4.8: L1 triplet appearance model variation mode 1

It is possible to decompose equation 4.18 to obtain the model frame shape or texture

as:x = x + Qsc

g = g + Qtc(4.20)

where, with Im denoting an m-dimensional identity matrix, the sub-matrices are given

by

Ws = wIms

Qs = PsWs−1Qcs

Qt = PtQct

(4.21)

As well as texture models that sample within the shape‘s convex hull using a tri-

73


Figure 4.9: L1 triplet profile gradient appearance model variation mode 1

angulated mesh, it is also possible to use profiles in an ASM-like manner. However,

unlike the ASM, all the profiles are concatenated together into a single texture vector.

Using profiles like this removes one advantage of AAMs, namely the warping of the

patch to a shape-free control patch. On the other hand by sampling further outside

the shape region, the convergence zone of AAM search can be extended. A degree

of shape normalisation can be introduced by scaling the profile length in the ratio of

the overall shape size (e.g. r.m.s. distance to centroid) to that of the mean shape.

Figures 4.9 and 4.10 show the variation in the first two appearance modes of a

profile model for an L1-centred triplet of vertebrae. The gradient along the profile

has been sampled and then sigmoidally renormalised to the local image statistics in

a manner discussed later in section 4.5.5.

4.5.3 Fitting Appearance Models

Essentially the goal of fitting an appearance model to a given image is to find a

set of shape pose parameters, texture scaling parameters, and appearance model

parameters c that best match the synthesised texture to the sampled texture. The

Active Appearance Model (AAM) is a method of fitting appearance model parameters

to best match the image evidence, given the implicit model constraints. One of the

strengths of AAMs is that by using a sum-of-squares measure to compare model and

target, they can exploit the linear nature of the problem to perform fast parameter

updates and thus are able to match to a new image very quickly. The AAM fitting

74


Figure 4.10: L1 triplet profile gradient appearance model variation mode 2

scheme shares some common ideas with the Active Blobs of Sclaroff and Isidoro [121],

namely that the texture should be projected into a fixed shape space, and that the

current residuals should drive updates to the model parameters via a linear model. In

the Active Blobs method the update matrix was derived from a single image with a

shape model based on elastic vibration modes (like Nastar and Ayache [104]), and the

texture model involved only planar lighting changes; whereas the AAM incorporates

more general variability over an ensemble of training images. The key insight of a

linear relationship between residuals and parameter updates is retained. So the AAM

learms from the training set not only the appearance model itself, but also how to fit

that model.

Firstly the shape pose parameters t (e.g. translation, scaling and rotation) and

texture scaling parameters u are concatenated with the appearance model parameters

c to give the full set of AAM parameters, denoted by p. Thus

pT =(cT |tT |uT

)(4.22)

The AAM seeks to minimise a sum-of-squares problem of the form

F (p) = |r(p)|2 = rT r (4.23)

where the vector of residuals r is calculated as:

r = w(I : p) − g (4.24)

75


where w(I : p) is the (normalised) texture sampled from the image I given the

shape defined by model parameters p. The AAM is trained (see [26]) how to solve

this minimisation by learning the relationship between perturbations in parameters

and the texture residuals that these induce. This relationship is then inverted to

provide an update matrix that can be applied to the current residual vector. When

applied iteratively this provides an efficient model fitting scheme where the necessary

parameter change δp to the parameter vector p, given r is estimated as:

δp = −Rr (4.25)

where R is derived from the pseudo-inverse of the Jacobian J = δrδp

thus.

R =[JTJ

]−1JT (4.26)

The Jacobian is learnt off-line by first fitting the appearance model to each training

example in turn, and then systematically perturbing each AAM parameter through

a series of steps from its correct value. The induced residuals then have a Gaussian

smoothing kernel applied to them, and by averaging over the training set a numerical

approximation to the Jacobian is derived.

Of course the linear relationship 4.25 may break down, particularly if the residuals

are large (i.e. outside the training region of J). In this case the update step may

actually increase the overall residual error. Therefore a simple form of line-search is

applied. Instead of applying an update δp the update αδp is applied, with α = 12.

If the error is still worse α is halved again, until either some improvement results,

or a minimum value of α is attained, in which case the AAM is assumed to have

converged.

Like the ASM, AAMs are typically used with coarse-to-fine search using Gaussian

image pyramids (usually separated by a scale factor of two). A multi-resolution AAM

starts from a coarse scale first, using a heavily smoothed image, and a sub-sampled

texture patch. Normally it will quickly find an optimal though imprecise fit. A few

iterations at each of the higher resolution layers refine the fit. Starting at the coarse

up-smoothed scale tends to avoid entrapment by local minima, and increases the

convergence zone of the AAM.

76


4.5.4 Initialisation

AAMs typically require an initialisation reasonably close (in some sense) to the sought

object, since they perform local search. A small set of approximate initialisation

points may be provided by a user interaction (e.g. clicking the mouse at a central

position); or from some higher level image understanding process. For example Howe

et al [77] use an initial search with a Generalised Hough Transform using a template

given by the mean shape model. The best matches in pose are used to initialise an

AAM.

4.5.5 Extensions to the AAM

4.5.5.1 Non-Linear renormalisation and Feature Appearance Models

Appearance models are not limited to simple grey level texture. It is also possible

to build appearance models of various feature measures. Typically these use a non-

linear renormalisation that makes subsequent AAM search more robust to changes

in contrast and lighting effects. Cootes [32] used Sobel-filtered gradients with the

sigmoidal renormalisation:

g′ =g

|g| + |g|(4.27)

Bosch et al [15] used a renormalisation to transform the asymmetric pixel-intensity

distribution of ultrasound cardiac images to a Gassian. This gave significantly im-

proved AAM matching results. The AAM’s L2 norm is theoretically optimal (in

a maximum-likelhood sense) for Gaussian residuals, so it makes sense to apply a

remapping to the sampled texture if this is more likely to result in Gaussian residu-

als. Another way of looking at it is that the AAM’s texture model can be a kind of

feature detector operating for example with some measure of edge strength or “corner-

ness”. Scott et al [123] developed a pair of orthogonal edge and corner measures

from the structure tensor in a similar manner to a Harris corner detector. These were

then renormalised onto a [0,1) scale using a similar sigmoidal function to that used

by Cootes [32]. An advantage of the renormalisation is that the value of the “feature

present” indication is upper-bounded at 1. This tends to avoid very large outliers

that can wreak havoc with least squares estimators. The other advantage of using

local image structure measures is that they are relatively invariant to imaging param-

77


Figure 4.11: Face corner feature appearance model - mode 1 variation

eters. Furthermore because the structure tensor used in the measures of [123] involve

smoothing in a region around each pixel, they involve more information from outside

the boundaries of the current shape. When this is also placed in a multi-resolution

search, the effective convergence radius can be significantly increased. Figure 4.11

shows the variation in the first appearance mode’s “cornerness” feature component

in an appearance model of a dataset of faces, using Scott’s feature appearance model

[123]. The eyes, nostrils, and mouth are clearly visible as corner features.

4.5.5.2 AAM variants

The AAM has been widely adopted, and many applications and modifications sug-

gested. Notable amongst these are the Shape-AAM [25] and the Inverse-Compositional

AAM [3]. In the Shape-AAM [25] the residuals drive only the shape and pose param-

eters, and the texture parameters are then set by directly fitting the texture model

to the texture sampled at the current shape. Overall combined appearance model

constraints can then be imposed. In the Inverse-Compositional AAM [3] of Baker

and Matthews, it is demonstrated that the shape update should be implemented as

a function composition rather than a simple linear addition. However this approach

requires the shape and texture parameters to be updated separately, so potential

advantages of modelling their correlation are lost.

A potential problem for AAMs is variations in R across the population. The estimate

of the Jacobian from the training set will only be an approximation for any given

target image, and may be a poor one if the target image is significantly different

from the training images, or in the margins of the apearance distribution. Bataur

and Hayes [4] have proposed Adaptive AAMs, in which the Jacobian varies as a

linear function of the position in parameter space. This can lead to more robust

78


and accurate convergence, particularly when dealing with examples where there is

significant texture variation (such as faces under differing lighting conditions). Cootes

introduced the Updating AAM [35] to address this problem. In the updating AAM

the update matrix is initially the same as the standard AAM, but is progressively

corrected as the search proceeds by continually re-estimating the Jacobian by utilising

the actual residuals occurring after the update.

4.5.6 Constrained AAM

Since the AAM is a local search method, and relies on an update matrix learned

in the locale of correct solutions, it relies upon a suitable initialisation. Typically

this initialisation is provided by prior estimates of some of the shape points, either

manually (e.g a user clicks the mouse pointer on a small subset of points), or via

automatic feature detectors. There may be some prior knowledge of the variances

associated with these initialisation points. Cootes developed the Constrained AAM

[31] to incorporate such constraints.

The least squares minimisation of the standard AAM is replaced by a maximum a-

posteriori (MAP) formulation, which seeks to maximise the probability of the model

given the data which (by Bayes theorem) is proportional to:

P (data|model)P (model) (4.28)

Given a uniform prior on the model parameters this is equivalent to a least squares

formulation under the assumption of uncorrelated Gaussian residuals of equal vari-

ance. A Gaussian prior could be assumed on the model parameters, and Cootes

showed how the AAM update step can be reformulated to incorporate this prior.

The effect is to pull the AAM solution more towards its mean, which may be helpful

in high noise conditions. In the rest of this thesis we will be more concerned with

incorporating prior knowledge about constrained points, so we summarise a simpli-

fied version of [31], ignoring the model prior terms (in effect a uniform model prior

is assumed).

Suppose we have prior estimates of the positions of some points in the image frame

X0, together with their covariance matrix SX . Unknown points can be represented by

zeroes, together with large upper bounds in SX , and effectively zeroes in SX−1. Let

79


d(p) = (X−X0) be a vector of the displacements of the current point positions from

their prior positions. We assume further that the prior point positions are Gaussian

distributed, and also that the texture residuals are independently and identically dis-

tributed with variance σr2. Then maximising the logarithm of the MAP is equivalent

to minimising:

E1(p) = σr−2rT r + dTSX

−1d (4.29)

By using a first order Taylor expansion similar to that used to derive the basic AAM

update equation, the parameter update is given by the solution to the equation set:

Aδp = −a (4.30)

where, after defining the Jacobian of d w.r.t p as K

A =(σr

−2JTJ + KTSX−1K

)

a =(σr

−2JT r(p) + KTSX−1d) (4.31)

and

K =δd

δp(4.32)

When computing the prior point displacement Jacobian K, it is necessary to take

into account the global pose transformation t as well as the appearance model param-

eters c. Cootes further developed the special case of isotropic prior point positional

variance with zero off-diagonal terms, and when the pose transformation St(x) is

a similarity transform which scales by s. Then let x0 be the prior point positions

mapped into the model frame, so x0 = St−1(X0), and let y = s(x − x0). Then

dTSX−1d = yTSX

−1y.

In this case:A =

(σr

−2JTJ + KmTSX

−1Km

)

a =(σr

−2JT r(p) + KmTSX

−1y) (4.33)

The Jacobian Km is the concatenation(

δyδc

|δyδt

)and:

δyδc

= sQs

δyδt

= −sδ(St

−1(X0))

δt− (x − x0) ·

(sxs ,

sy

s , 0, 0) (4.34)

80


The update equation can then be solved using standard methods in linear algebra;

for example, since the matrix A is symmetric, Cholesky decomposition [108] can be

used for speed to invert A; but if that appears ill-conditioned, then SVD can be used

to robustly calculate an inverse (in the least-squares sense).

4.6 Model Optimisation using Minimum Descrip-

tion Length

One potential problem with statistical shape models is the need to establish the

set of correspondences between images in the training set. If points are placed at

positions that are in some sense inconsistent, then the shape model is degraded,

and may generate unphysical shapes. This can be particularly problematical when

not enough points are used. The correspondence problem is particularly acute when

dealing with 3D shapes with rather amorphous regions. An approach which shows

merit in partially automating the model-building process was introduced by Davies

et al [39, 40]. This approach using an information theoretical approach using the

minimum description length (MDL) of the model and training set. The idea is that

a consistent set of correspondences will lead to a more compact description and a

better model. MDL reformulates the correspondence problem as a method of finding

the most compact coding of the training set. This requires the minimisation of the

sum of the description lengths of: the model, each shape’s model parameters, and

each shape’s residuals. An existing shape segmentation is still required, but the set of

corresponding points is obtained by a re-parameterisation of the shape using a set of

kernel functions. The method then finds the set of kernel parameters which minimises

the total description length. Davies et al [39, 40] compared MDL-optimised shape

models to manually crafted ones and found that the MDL-optimised models had

better specificity (synthesised shapes more closely match at least one shape in the

training set), and generalisability (lower residuals in miss-1-out tests).

The original formulation of the MDL approach required an existing segmentation to

reparameterise. More recently Cootes et al [29] have experimented with a completely

automatic model building process by generalising MDL to building appearance mod-

els within a groupwise registration scheme. Shape and texture models are computed

using a current set of corresponding points for all but a current target image. This

current target image then has its points moved in order to minimise the description

81


length of encoding it using the current model. This process is repeated for each image

in turn, withing a larger iteration loop, until (it is hoped) the process converges to a

set of co-registered images, with a set of points in correspondence. The method has

given promising results on a face dataset, and on a dataset of MR images of normal

brains. Current evaluation criteria have focussed on specificity measures, partly due

to a lack of ground truth.

4.7 Conclusions

Methods purely driven by image data are often unreliable as they are too under-

constrained, especially in medical images where noise and clutter are ubiquitous.

Difficult image understanding and segmentation problems require strong priors. Us-

ing linear models with orthogonal bases for these priors has many advantages, due

to their mathematical simplicity and the associated least squares methods, and their

relative ease of optimisation. The ASM and its descendants (particularly the Active

Appearance Model) form the most suitable class for many problems, as not only is

the shape model based on a training set, but there is also a texture model used to

match to an unseen image, which is also learnt rather than imposed. Because the

AAM learns so much from its training set it is relatively free of ad-hoc parameters.

Although the shape and texture models of the ASM and AAM provide helpful con-

straints that avoid unphysical solutions, and are robust against noise and partially

missing structure, there still remains the problem that they will fail to adequately fit

unseen objects which deviate too far from the training set. Thus model-based meth-

ods can suffer from an undertraining problem. In medical images, the application

may well be related to diagnosing disease. Unfortunately it is precisely the (rare?)

pathological cases in which the shape model may be undertrained. For example an

ASM can be used to fit to vertebrae in images of the spine, but the ASM would not

be able to fit to a severe vertebral fracture or an extreme scoliosis if none such were

present in the training set.

In such cases the greater flexibility of a “snake” could be advantageous, perhaps as

a user-invoked fallback mode. Also annotating the images to create the training set

can be very time consuming. This can be “bootstrapped”, by building initial models

which are then used to provide a degree of fitting to further training images, with the

82


model being continuously updated as more training images are added. In the early

stages of annotating the training images a snake approach could be used, steered by

an expert user. Thus the snakes could assist in the generation of a better trained

model. Further work on MDL-based automatic model building [29] may eventually

allow automation of the tedious process of hand-annotation of the training set.

In the subsequent chapter we also introduce a method of using multiple sub-models

which reduces problems with undertraining.

83

Chapter 5

The consistent combination of

multiple sub-model AAMs

5.1 Introduction

In this chapter we discuss our methods for semi-automatically segmenting vertebrae,

but presented in a generic form. Rather than using a single model of the whole

spine, we have developed methods for using a sequence of linked sub-models. We

published an early paper [113] on a basic version of these ideas immediately prior

to commencing this PhD. This work is also substantially summarised as part of this

chapter, for reasons of logical completeness.

We begin by discussing a general trade-off concerning sub-model flexibility and con-

straint. We define a general sequencing algorithm for linking multiple sub-models,

illustrated specifically by the vertebral segmentation problem. We compare different

methods of sequencing the fitting of the several sub-models.

We have already published most of the ideas in this chapter in a number of papers

[113, 114, 116, 117]. However as the work has been spread over four different papers, it

is necessary to rewrite the content into a more integrated whole. Whereas the earlier

implementation [113] prior to this thesis assumed zero off-diagonal covariance terms,

in this thesis we examine the use of bootstrapped estimates of inter-point covariance,

which could allow better modelling of the links between the multiple sub-models.

84

Chapter 5. The consistent combination of multiple sub-model AAMs

Note that detailed results on vertebral segmentation accuracy with a variety of AAM

forms and sub-model structures are presented in the next chapter. Here we deal with

the more generic algorithm for combining sub-models together in a linked sequence.

5.2 Model-Based Segmentation - Some Trade-Offs

5.2.1 Statistical Models in Medical Imaging

Many problems in medical image interpretation require an automated system to anal-

yse images. These images may provide noisy data, and typically complex structure.

As noted in the previous chapter, model-based methods offer solutions to these diffi-

culties [33], by enforcing strong priors learned from a set of annotated training im-

ages. A widespread such approach is the Active Appearance Model (AAM) [33, 24],

discussed in detail in section 4.5.

5.2.2 Global vs Local Models

Although the use of a model provides helpful constraints that avoid unphysical solu-

tions, there still remains the problem that as the model is based on a finite training

set, it will fail to adequately fit unseen objects which deviate too far from the training

set. In particular under-training of the model may mean that it may is insufficiently

adaptable on a local level, especially when pathologies are present. Sometimes the

training set might reasonably capture the variation in some sub-structures, but not all

the necessary variation in their inter-relationships. For example large kyphosis (high

spinal curvature) or scoliosis (pathological lateral displacement of vertebrae) may not

be captured in a global model. Or, since one vertebral fracture means that others are

more likely, in a small training set some spurious correlations may be learned, so that

for example a severe fracture of T12 might only be reproducible in a global model

with a severe fracture of T10. In previous work [113] we showed that such under-

training problems could at least be mitigated by using multiple sub-models, partly

because the use of sub-models means that each sub-structure can have its own pose

parameters. Moreover when there are projective effects present (e.g. from a divergent

X-ray beam) each sub-model can have a locally optimised affine transform to best

approximate to the varying projection. Another advantage of using sub-structures

85


is that the order of their fitting can be dynamically adjusted to defer problematical

regions. Problems can arise from either clutter (e.g. soft-tissue), high noise, par-

tial occlusion, or from pathologies. We show later how this dynamic sequencing can

be done using a quality of fit measure; in the next chapter we show that this dy-

namic ordering improves accuracy on DXA images of the spine. Also when there is

varying brightness, contrast, or other variable lighting effects across the image, the

use of smaller structures may mitigate such problems by allowing local normalisa-

tion. Finally inherent non-linearities, or higher moments (e.g. skewness) of the shape

distribution, are even harder to capture in a single global model.

But if the size of modelled sub-structures is too small, then in noisy data the models

can be too unconstrained. So there is a trade-off in the optimal size of structures to

model. Even when models of sub-structures are used, it is still necessary to express

the global level relationships between them. We show that if the constrained form of

the AAM is used [31], this linkage can be naturally expressed in terms of soft point

constraints, which are recursively updated as each sub-model is fitted. It is possible

to overlap the sub-structures, thus providing natural linkage. Also (or instead) a

global shape model of all the points can be used to constrain the sub-models.

5.3 Combining Overlapping Sub-Models

We summarise how the linked sub-model search algorithm for locating vertebrae

operated in our earlier work [113]. Then we describe the generalisation and the

enhancements we have introduced in the course of this work.

5.3.1 Vertebral Triplet Modelling

5.3.1.1 Sub-Model Iterative Linkage

In [113] the spine was modelled by a sequence of overlapping triplets of vertebrae, and

the sequence of sub-model solutions were combined as follows, using a fixed ordering

starting by going up the lumbar. Each vertebra is fitted using the triplet sub-model in

which it is central (Figure 5.1). When a triplet model has been fitted it also provides

part of the initialisation (via the overlapping vertebrae) for subsequent iterations

86


which fit its neighbours. Furthermore constraints are applied so that overlapping

vertebrae cannot be moved far from the provisional positions determined previously

- the constrained AAM [31] (equation 4.31) rather than simple AAM is used. Note

also that subsequent iterations do not update any point positions which have already

been determined, unless the point is in the central vertebra of the triplet.

This feed-forward of constraints is further aided by re-fitting the global shape model

to the solution so far. This is used to initialise a starting solution for vertebrae not

yet fitted, but only low constraint weights are attached to this global prior. Thus

information in the global shape model is still used in guiding the solution, but the

global shape constraints are downweighted, which allows their violation to a degree

if the image evidence locally supports such a solution.

Our approach differs somewhat from that of Davatzikos et al [38], who propose a

Hierarchical Active Shape Model based on a wavelet decomposition of the shape

contours, in which local regions of the wavelet transform space are decoupled when

applying the shape model constraints. In [38] coarse global constraints continue

to apply in a strict sense. However, in the case of segmenting vertebrae, certain

pathologies of the spine such as scoliosis may cause even coarse aspects of the shape

model to be violated, as whole vertebrae can be laterally shifted outside the region

captured in the training set. Therefore we use multiple overlapping AAMs instead,

which allows a wider range of pathologies to be adequately fitted. Also there may be

some advantages in decomposing the image search process, as well as the application

of shape constraints. Furthermore our approach is more suitable to AAM search, as

unlike with the ASM the shape constraints are not applied after first locating a set

of target positions, but are intrinsic to the search process per se.

5.3.2 Generalisation

This idea of linking sub-models can be generalised. It is simply necessary to label

points in a sub-model as either core (e.g. the central vertebra of the triplet), or overlap

(i.e. included to provide linkage with other sub-models). It is also necessary to store

the overall solution in for example a global vector of points. Each sub-model also has

its own local copy of its sub-solution only. The difference between core and overlap

points is that the core points are always copied back into the global solution when

the sub-model is fitted (even if they had been provisionally set as part of a previous

87


Figure 5.1: An illustration of the first two iterations (static fit ordering) combining verte-bral triplet sub-models. a) shows the result of fitting the first triplet containing L4/L3/L2.b) indicates the the next iteration which fits L3/L2/L1. The second iteration resets L2(now core) and provides an initialisation of L1, and an updated prediction for T12 via theglobal shape model. However the previous solution for L3 is retained (not core).

sub-model); whereas the overlap points are only copied out once the first time they

are fitted. As overlap points can be contained in more than one model this distinction

is made in order not to compromise overall consistency. The second difference is that

once core points are fitted they are assigned higher constraint weights. In principle

the overlap can be null, in which case the updated predictions made through the

global shape model provide the only linkage.

5.3.3 Dynamic Sub-Model Sequence Ordering Algorithm

If one sub-model fit fails, then this can misalign the starting solution for subsequent

iterations. In such a case it would be better to adapt the sequence dynamically by

comparing the fit quality of several candidate sub-models. Picking the best quality

fit model as the one to impose at this iteration will tend to defer noisier or poorly

fitting regions until they have been better constrained by their neighbours. Therefore

we developed the originally static ordering of [113], and use a dynamic ordering,

which is also easier to generalise. This extension to the algorithm has been published

previously in [114]. A set of Nc candidate sub-models is maintained. Each candidate

88


is provisionally fitted, then the sub-model with the best fit quality is imposed into the

global solution. We use the residual sum of squares as the basis of the fitting quality

measure, but with some renormalisation as discussed subsequently. A new candidate

is then added as detailed in the next paragraph, until all sub-models have either been

fitted already, or are now in the set of candidates. When no more candidates can be

added the remaining Nc − 1 sub-models are fitted (best quality first) and the search

concludes.

A new candidate is added at the end of each iteration by searching from the latest

best candidate to locate its nearest neighbour that has never been added to the

candidate list. In the vertebral case the ordering uses the current estimates of the

distance between vertebral centres. In more general applications it would be necessary

to define a kind of proximity ordering on the sub-models in order to know which

candidate to push into the list next. This ordering can be viewed as a set of adjacency

relations, which may not necessarily be transitive overall. For example rather than

have a single overall ordering of the model set, each sub-model can have its own

ordering of all the others, based on either mean proximity, or mutual correlation.

5.3.4 Algorithm Pseudo-Code

In this section we summarise the sub-model combination algorithm more formally in

a pseudo-code form.

We make use of the following nomenclature:

Let C be the working set of current candidate AAMs, and the R be the set of indices

of remaining sub-models not yet fitted. Let x(g) be the current global solution of all

points. Let S(g) denote the global covariance matrix of the errors in x(g).

Let M be the global shape model. Let x(g) be a vector containing the global shape

model’s weighted best fit to x(g), with weights determined by the respective (recipro-

cal) variances. Note that points in x(g) that have not yet been determined become in

effect predictions from the points that have been determined, because the determined

points have substantially higher weights attached. Let d be a vector of boolean flags

indicating which points have been determined by a sub-model fit. Let 〈mi|mj〉 denote

the proximity measure of AAM sub-model i to sub-model j. At each iteration, it is

necessary to copy between the partial global solution and relevant sub-model local

89


storage. We let Pi(x(g), x(g),d, f) denote the operation of copying from the global

solution and prediction vectors into a local points vector for sub-model i, where f is

a boolean flag set true on the first iteration, and false thereafter. Also let Ti(S(g))

denote a similar operation to extract relevant portions of the covariance matrix.

We also adopt the following conventions.

1. Let M→fit(x,w) be the weighted least squares fit of model M to points vector

x using corresponding weights w. See Appendix A.

2. Let m→x denote the current points solution of model m

3. diag(A) denotes the vector formed from the diagonal elements of matrix A.

4. Given a vector w we use w[−1] to denote the vector whose elements are the

inverse of those of w, i.e. [w[−1]]i = ([w]i)−1.

5. We use the C-style ? operator with boolean flags, so (b ? x1 : x2) means the

latter expression evaluates to x1 if boolean variable b is true, or x2 otherwise.

Firstly it is necessary to somehow initialise the global solution points vector x(g),

and set an initial initial covariance matrix S(g) to reflect the variance of this process,

together with a corresponding correlation matrix T(g) . For example some limited

user interaction may fix an approximate inital solution, or a salient feature detection

process might produce an initial estimate. We use a parameter σ2I which is the mean

point error variance associated with the initialisation, i.e. σ2I = Tr(S(g))

2n. After the

first iteration (i.e. once a sub-model has been fitted), then the relevant remaining

terms have a reduced variance σ2P (see section 5.3.5.3). The algorithm then proceeds

as defined in Algorithm 1.

5.3.5 Updating the Constraint Variance

5.3.5.1 Simple diagonal constraint covariance matrix

When using the constrained AAM, the optimum solution depends on the covariance

matrix SX . In our earlier work [113] we assumed a simplified diagonal form with

essentially just three variances: low, medium and high values. These corresponded

90


Algorithm 1 Formal Sub-Model Combination Algorithm

1. Insert the initial Nc candidate sub-models into C and initialiseR = {i : mi /∈ C}.

2. Set “first” iteration flag f = true and [d]j = false, ∀j

3. While C 6= ∅ loop

(a) Initialise sub-model quality container, Q = ∅(b) For each sub-model mi ∈ C do

i. Copy current global solution and covariance matrix partitions tolocal constraints x′

i, S(i)X , thus (See Algorithm 2):

A. (x(g), x(g))Pi(d,f)7−→ x′

i

B. S(g) Ti7−→ S(i)X

ii. Set weights w = diag(S(i)X )[−1]

iii. Initialise mi APM parameters to the best weighted fit of mi to x′i

given weights w (see Appendix A).

iv. Perform a constrained AAM fit of mi to the image, given pointconstraint values x′

i and covariance matrix S(i)X .

v. Calculate the quality of fit of this sub-model q (see section 5.3.6),and store in a suitable ordered container Q so that Q(i) = q.

(c) if f then reinitialise S(g): S(g) = σ2P T(g)

(d) Obtain index of best-fitting model, k = argmax(Q)

(e) Impose sub-model mk into global solution, thus updating x(g),S(g),d.See Algorithm 3.

(f) Update predicted points thus:

i. Set weights w = diag(S(g))[−1]; then w′ is set thus:[w′]j = (dj ? wj : σI

−2) ∀j

ii. x(g) = M→fit(x(g),w′)

iii. Adjust the global covariance matrix for updated predictions ifnecessary - see section 5.3.5.4.

(g) C 7→ C \ {mk}(h) if R 6= ∅

i. k′ = argminj∈R (〈mk|mj〉)ii. C 7→ C ∪ {mk′} R 7→ R \ {k′}

(i) f = false

91


Algorithm 2 Project points into sub-model, operations Pi(x(g), x(g),d, f); Ti(S

(g))

Let Ii denote the ordered set of point indices of sub-model mi. Note we assume thatthe points have the same relative ordering in the sub-model as the global model, withthe first sub-model point having local index 0 and global index [Ii]0 (i.e. as in C-style array indexing). The operation sets an output vector xi and output covariance

matrix S(i)X

1. k = 0

2. For j ∈ Ii do

(a) Perform Pi(x(g), x(g),d, f) for the current point thus:

i. if [d]j = true or f = true then [xi]k = [x(g)]j ;otherwise [xi]k = [x(g)]j

(b) Perform Ti(S(g)) for the row corresponding to the current point thus:

i. k′ = 0

ii. For j′ ∈ Ii do

A. [S(i)X ]kk′ = [S(g)]jj′

B. k′ 7→ k′ + 1

(c) k 7→ k + 1

92


Algorithm 3 Impose sub-model into global solution

Let Ii denote the ordered set of point indices of sub-model mi. Note we assumethat the points have the same relative ordering in the sub-model as the global model,with the first sub-model point having local index 0 and global index [Ii]0 (i.e. asin C-style array indexing). Let IC denote the ordered subset of Ii containing the

indices of points in the core of this sub-model. Let S(i)X denote the estimated sub-

model covariance matrix (after fitting this sub-model), as given by equation 5.7.Let T(g) denote the global correlation matrix obtained by renormalising the initialvalues of S(g) (see section 5.3.5.3). Note that as discussed in section 5.3.5.3 we usea renormalisation to a constant mean prediction variance σ2

P .

1. x = mi→x

2. k = 0

3. For j ∈ Ii do

(a) if j ∈ IC or dj = false

i. [x(g)]j = xk

ii. k′ = 0

iii. For j′ ∈ Ii do

A. if j′ ∈ IC or dj′ = false then [S(g)]jj′ = [S(i)X ]kk′;

otherwise [S(g)]jj′ = [S(g)]j′j = 0

B. k′ 7→ k′ + 1

iv. Renormalise the cross-covariance terms in the global covariancematrix between this updated value and all other predicted points.See section 5.3.5.3. Note that these steps could be replaced by a fullupdate of the prediction covariance matrix, see section 5.3.5.4. Indetail:

v. Predicted point indices set JP = {j′ : j′ /∈ Ii and dj′ = false}vi. For j′ ∈ JP do

A. [S(g)]jj′ = σP

√[S(g)]jj[T

(g)]jj′

B. [S(g)]j′j = [S(g)]jj′

(b) dj = true

(c) k 7→ k + 1

93


respectively to: determined core points; overlap points which had been fitted but

were not yet core; and predicted points (i.e. predicted via the global model from

either the initialisation, or consequent to previous sub-model fits). See for example

Figure 5.1. The constraint variances for points in the core portion of the sub-model

just fitted are reset to reflect the estimated variance of the point location error σa2∗;

whereas the associated variance for those points in its overlap subset are set to a

higher value σO2, which effectively downweights them, allowing greater flexibility to

neighbouring sub-models which will subsequently refit them. Other points which

have not yet been fitted (and are merely predicted via the global model) had a high

variance. This was set on the first iteration to the average variance (over all points)

of the initialisation (point-to-line) error σI2. Subsequently the prediction variance

was set using a constant σP02 obtained by averaging over all points in the spine. This

was obtained by taking the initialisation points as given, and then fixing the points

in just two vertebrae in turn to their manually annotated values, and then obtaining

the best shape model fit to this point subset. The (point-to-line) prediction errors for

points in the next neighbour can then be obtained, and an average variance calculated

over the spine by gradually moving up fixing just one pair of vertebrae in turn. So

for example we fix the values of points in L4 and L3, and use the global model to

predict L2; next we treat L3 and L2 as fixed and predict where L1 is; and so on. In

the general case the relevant subset of points to average the predictions over at each

stage are those in the nearest other sub-model. The actual prediction variance used

in the algorithm is then further degraded by adding on the variance assumed for the

determined points, so finally:

σP2 = σa

2 + σP02 (5.1)

In reality the set of points that have been already determined may well contain more

than two vertebrae, especially when the dynamic sequencing method is used, so this

method represents a slightly crude upper bound. A possible more exact calculation

is presented later, but we have not implemented it.

Somehat arbitrarily the overlap point variance σO2 was set in [113] by averaging the

∗as a point-to-line error, which then produces an isotropic variance of σa

2 in both Cartesiancomponents

94


prediction variance and the core accuracy so

σO2 = 0.5(σa

2 + σP2) (5.2)

The value of σa can be initially set by determining the mean accuracy of standard

global AAM, and if necessary recursively fine-tuned in subsequent experiments with

sub-models.

5.3.5.2 Non-diagonal constraint covariance matrix

In reality the simple diagonal assumption will fail, as there are bound to be correla-

tions between the errors in nearby points. In this section we discuss how a plausible

(though approximate) covariance matrix can be introduced. The true values of SX

are difficult to determine, and indeed are somewhat circular, since they depend on

previous stages of the fitting algorithm, which in turn depend on the values assumed

for SX . Also we wish to include a degree of “model noise” to reflect inadequacies of

the training set, which by definition is something of an unknown. However we can say

that we expect correlations in the errors to be related to the fundamental correlation

between the points themselves, as captured in the shape model. Clearly there will

be correlation between the errors in estimates of nearby points, as fitting errors will

tend to displace several adjacent points off their true location in a connected manner.

We recall that the AAM will be working in appearance parameter space, and so it is

natural to explore the effects on point errors of small errors in the model parameters.

These are not simple linear transforms due to “model noise”, so we estimate their

effects via “miss-M-out” bootstrap experiments.

Secondly the purpose of having the “medium” constraints attached to sub-model

overlap points is to provide a degree of linkage, but in a way which does not over-

constrain subsequent sub-model searches for which these points will be core. Part of

the purpose of this is to recognise that partial search failures do occur, and to have

a kind of “second-chance” when fitting subsequent sub-models. We therefore view

the medium constraint weights that should be attached to these overlap points as

intermediate between the underlying fit accuracy (on successful searches), and that

which would be obtained if we simply used the sub-model’s shape model to predict

the overlap points given the core. Thus for the overlap variance we use on point i in

95


a sub-model:

σiO2 = 0.5(σa

2 + σip2). (5.3)

where σip2) is the prediction variance of overlap point i given the core solution. The

prediction covariance terms are derived in “leave-M-out” bootstrap experiments as

follows.

We randomly select a test subset (M=16), and train shape models on the remainder

of the training set. We then loop over the test set. For each test image we take its

annotated shape and select the shape model parameters to best fit the core points

only. In essence the covariance of the prediction errors in the overlap points can then

be calculated from the differences between the annotated positions and their values in

the (core-fitted) shape model. We also wish to have an a priori plausible covariance

matrix for the fitting errors on the core, so we make small random perturbations to the

shape model parameters, which allows an overall covariance matrix to be established.

The scale of the perturbations is selected so that the average expected point variance

induced over the core points equals σa2, though in reality it may be slightly higher

due to model inadequacy.

The perturbation in parameter k in test image ν is a zero-mean Gassian with standard

deviation ∆νσk, where σk is the shape parameter’s standard deviation (derived from

the mode’s associated eigenvalue), and ∆ν is a scaling factor. Recalling the linear

nature of the shape model, the variance induced in Cartesian co-ordinate xi will be:

σi2 = ∆ν

2ms∑

k=1

[Ps]ik2σk

2 (5.4)

Let σi2 =

∑ms

k=1 [Ps]ik2σk

2 Then ∆ν is determined by averaging the set of σi2 over

the coordinates of core points and requiring an average variance of σa2. Let the sets

IC and IO define the indices of points in the sub-model which are core or overlap

respectively. Then

∆2ν =

|IC |σa2

s2ν

∑i∈IC

σi2 (5.5)

where sν is the scale of the model to world co-ordinates similarity transform for test

image ν.

The covariance matrix C thus derived is then renormalised into a correlation matrix

96


T to allow a more flexible change of variance scaling. After fitting a sub-model we

then fix variances of core points to σ2a, and variances of overlap points to σiO

2. Note

that when deriving the latter via 5.3 we set

σip2 = Cii (5.6)

Thus after a sub-model has been fitted, (and selected as the best fitting one of the

candidate set), and so associated point constraints need updating, the point covari-

ance used on its updated points in a subsequent constrained AAM (see equation 4.31)

is given by the appropriate term in:

[SX ]ij = Tijσa2 i, j ∈ IC

[SX ]ij = TijσiOσjO i, j ∈ IO

[SX ]ij = TijσiOσa i ∈ IO, j ∈ IC

[SX ]ij = [SX ]ji i ∈ IC , j ∈ IO

(5.7)

But if previous sub-model fits had already determined some of the overlap points,then

since these points will not be updated again by this sub-model, the diagonal (variance)

overlap terms are not updated, and the covariance terms between such overlap and

core points are instead reset to zero. This reflects the fact that the respective core

and overlap points have been set by notionally independent estimators.

5.3.5.3 Covariance matrix of predictions via global shape model

Finally for points which have merely been predicted but not yet updated we simply

continue to use the prediction error correlation matrix resulting from the initialisation

process. This again is derived in bootstrap resampling experiments (with separate

train/test splits). The initialisation process is application specific, and may need to

simulate user input such as clicking on a small subset of initialisation points with

some typical precision errror; or running some other initial salient feature detection

process. It could be as simple as assuming the mean shape and pose. The difference

between the actual shape and the shape generated by the initialisation process is

recorded for each test image, and then a prediction error covariance matrix S(g)I can

be derived.

This matrix of course relects the errors at initialisation. After the first iteration, in

97


order to allow for a somewhat reduced prediction error as more points are determined,

we renormalise this matrix by first converting it to correlation matrix T(g), so

[T(g)]ij =[S

(g)I ]ij(

[S(g)I ]ii[S

(g)I ]jj

)0.5 (5.8)

Then we multiply by σP2 (see equation 5.1 above). Thus the variance terms are

constant reflecting an approximate average as in the earlier implementation.

Thus in our current implementation, although the predicted points themselves do

get updated as the search proceeds, the associated covariance terms do not entirely

reflect the full set of available predictors. This is admittedly inconsistent, but in our

application, the main constraints arise through overlapping the sub-model triplets.

If the sub-model coupling methods we have developed were to be used in another

application where the sub-models had a lesser overlap, then the updated predictions

through the global model would have to play a stronger role. In such a case the con-

ditional global covariance matrix (given the points determined so far) should be used,

but including some additional “model noise”. However in the vertebral case there

is already a set of constraints imposed via the overlapping of the sub-models them-

selves, and it is simply not worth the computational effort to evaluate the conditional

covariance matrix for predicted points; especially when the associated constraints are

in any case quite weak. However in the next section we outline how this could be

done.

A slight complication is how to update the covariance terms between predicted points

and those which have just been determined by a model fit. For example if we fit the

vertebral triplet centred on T12, then the covariance terms relating T11,T12,L1 are

updated. This has a knock-on effect on all other cross-terms between all these points

and all other points which are currently predicted. The update arising from the latest

sub-model imposition means that the cross-covariance terms between for example

T10 and T11 are in effect “broken” - as these are still set to values appropriate to

the initialisation, whereas the variance of the T11 estimates has just been reduced

by imposing the T12 solution. We therefore update these cross-terms assuming the

same prediction correlation coefficients as at initialisation, but we renormalise for the

updated T11 point variances. This ensures that the sub-partitions used in sub-model

fits remain consistent.

98


5.3.5.4 How to calculate a consistent global prediction covariance

Although as noted above, we have used an approximate form of the prediction co-

variance, it is possible to calculate the correct form of this, given whatever set of

sub-models have been determined so far. Although we have not implemented this we

include it for completeness, and because other applications with less (or no) overlap

between sub-models may require it. First suppose that some subset of the overall

points are known, thus defining a (partial) shape S2. As more sub-models are fitted

the number of points in S2 increases, which in general will decrease the prediction

errors in the remainder of the shape S1. Note that S2 may comprise not only those

points which have already been determined by previous sub-model AAMs, but also

any special points set in the global initialisation process. Firstly for convenience

we suppose that these two subsets have the point indices re-ordered so that we can

partition the overall global covariance matrix C thus:

C =

(C(11) C(12)

C(21) C(22)

)(5.9)

Then given the shape vector x(2) of S2, and assuming a joint Gaussian distribution,

the conditional distribution of S1 is still Gaussian. Formulae for the conditional mean

and covariance are given for example by de Bruijne [17]. The maximum likelihood

estimator of S1 is given by [17]:

x(1) = x(1) + C12C22−1(x(2) − x(2)) (5.10)

The covariance matrix K0 of this conditional estimator is given by:

K0 = C(11) −C(12)C(22)−1C(21) (5.11)

There can be numerical problems in inverting C(22) due to chance covariance in the

training set, and multi-collinearity arising from estimating it from a small sample.

De Bruijne et al [17] suggest using ridge regression to avoid numerical problems by

adding a small constant along the diagonal of C(22).

In fact K0 represents the covariance given a perfect knowledge of S2, whereas in

99


reality we are assuming x(2) has errors. Because it is a linear prediction process we

can calculate the additional induced covariance from the regression coefficients of S1

on S2.

Defining for convenience

B = C12C22−1 (5.12)

Then the additional covariance induced in S1 given the current covariance matrix

V(2) for S2 will be given by

K1 = BV(2)BT (5.13)

This matrix must be then added to K0 to obtain the final covariance matrix K for

the predicted points. Also because of undertraining there is a certain unmodelled

error which we represent by an additional variance term σm2. This can be estimated

by bootstrap refitting of the shape model back to unseen examples using random

selection to split the training/test set (e.g. ’leave-8-out’). Then by averaging over

all points in all such random test set selections, the residual variance σm2 can be

obtained, which is then added to the diagonal elements, hence:

K = K0 + K1 + σm2I (5.14)

5.3.5.5 Inverting the covariance matrix

Because the constrained AAM actually works with the inverse of the points’ covari-

ance matrix, we work with each sub-model’s locally relevant sub-partition of the

overall covariance matrix. This sub-matrix is in effect inverted using SVD. As the

covariance matrices are necessarily symmetric we could have used Cholesky decom-

position [108], which would be more efficient. However there may be problems in

obtaining a full rank covariance matrix if the training set is not sufficiently large, and

so we use SVD which is more numerically robust in such situations. If dealing with

larger matrices (e.g. full updating of the prediction covariance matrix as discussed in

the previous section) then it might be more efficient to add in small constants along

the diagonal of the matrix, effectively then using ridge regression as recommended by

de Bruijne ( [17]);and then use Cholesky decomposition. Note that we never use the

inverse of the covariance matrix itself, as this always appears in equation 4.31 right

multiplied by a vector (or another matrix which can be regarded as a column-wise

100


concatenation of vectors). Thus we actually numerically solve equations of the form

SXd = y (5.15)

using a pre-calculated singular value decomposition of SX . Similar considerations

would apply to inverting C22 in equations 5.10 and 5.11

Another implementation detail is that because we work with several sub-models at

each iteration (and then only select the best-fitting one), some care needs to be taken

over the precise timing of when the singular value decompositions are recomputed. A

kind of cache is used, and whenever a sub-model is selected as the one to impose at

this iteration, any neighbouring models which have a mutual overlap are flagged as

requiring re-calculation of the singular value decomposition. Otherwise cached values

may be re-used.

5.3.6 Quality of Fit Measure

The “best fitting” model is essentially taken to be the one with the lowest residual

sum of squares. However some rescaling is applied to try and have a more consistent

way of comparing different sub-models containing potentially different numbers of

points. Firstly the scaled residual sum of squares is calculated as:

S =n∑

i=1

r2i

σ2ri

(5.16)

where ri is the grey level residual at point i, and σri is its estimated standard de-

viation. Then we effectively map S onto its associated cumulative density function

(CDF), as when comparing different sub-models with different numbers of points it

is not meaningful to directly compare values of S. So S is converted to a value on

a standard Gaussian approximating the associated theoretical χ2 distribution. In

the standard appearance model training code we use, the set {σri} is estimated by

refitting the appearance model back to its own training set, and then calculating the

residual variances. However this tends to under-estimate the true values when fitting

to unseen data, so we introduced an additional scaling parameter α on the expected

value of S to account for this. Also model inadequacies can cause nearby residuals to

be positively correlated, so the variance of the sampled distribution is boosted, which

101


we model by a second scaling parameter β. This can happen because for example

shape model inadequacies lead to adjacent portions of edges being offset from the

“true” edge, which leads to spatially consistent residuals, whose positive correlation

will boost the overall residual variance.

If the distribution were a true χ2 then we would have:

E(S) = n var(S) = 2n (5.17)

We extend this to

E(S) = αn var(S) = 2α2βn (5.18)

Estimates of these multipliers can be obtained by repeated random splitting into a

training/test set, and estimating the distribution of S from the trained appearance

model’s best fit to the annotated image (given the annotated shape). There is a

danger that outliers from images containing atypical pathologies will bias the fitting

process. Therefore a robust estimator of sample variance was used. We used the Sn

statistic of Rousseeuw and Croux [120], which estimates the standard deviation of a

distribution given a sample set X as:

σ = 1.1926M{i}({

M{j} ({|xi − xj | : xj ∈ X}) : xi ∈ X})

(5.19)

where M{k}(S) represents the median of values in set S, whose elements are iden-

tified by a set of indices {k}. The median of median sample separations is used,

so no estimate of central location is needed. This estimator has significantly better

normal efficiency than the more common MAD estimators, and is more appropriate

for asymmettric distributions.

Having estimated sample standard deviation (and hence variance vS) by Rousseeuw

and Croux’s method, we then calculate a sample mean µS on a trimmed sample,

obtained by excluding any outliers more than three standard deviations beyond the

median of S. The robustly estimated sampled variance vS and trimmed mean µS of

the sampled values of S then determine α and β thus:

α =µS

nβ =

nvS

2µ2S

(5.20)

102


The final standardised Gaussian z-value used is:

z =S − αn

α√

2nβ(5.21)

The quality measure is obtained by just negating this, as picking the highest value of

−z is equivalent to choosing the candidate with the lowest probability of obtaining a

residual sum of squares any better than that achieved.

5.4 Conclusion

We have presented a general algorithm for combining multiple (possibly overlapping)

sub-models by using the constrained AAM. A global shape model is also still used, to

provide linkage via iteratively updated predictions. Global constraints are therefore

still present, but as a weak prior. We have also developed a method of allowing the

fitting order of the sub-models to be determined by the data, using the heuristic of

“best quality first”. We believe that this will in general improve the robustness of the

search by deferring noisy or poorly fitting regions until they have been constrained

by their neighbours. Also this permits generalisation of the algorithm to cases where

there is no natural fitting order present. In the next chapter we will present results

showing the application of these methods to DXA images of the spine, and will

demonstrate that the dynamic ordering method improves the accuracy compared to

a static ordering of proceeding up the spine starting with the lower lumbar vertebrae.

The dynamic ordering algorithm together with a principled update of the global

constraints lead to a natural generalisation of the approach. This applies even when

the sub-structures have no overlapping portions, and no natural ordering for the fit

sequence, since the data itself suggests the sequence.

In principle the algorithm could be applied hierarchically with more than one layer

of decomposition. For example a vertebral triplet could be decomposed into the

upper and lower halves of single vertebrae. Such layered decomposition would be

intermediate between standard AAMs and the single point local AAMs of [30]. The

generality means this approach offers an interesting way of handling under-training

of pathological cases - a perennial problem in applying model based vision in medical

contexts.

103

Chapter 6

Vertebral Segmentation using

Multiple AAMs

6.1 Introduction

In this chapter we apply the multi-AAM method of the previous chapter to the

problem of segmenting vertebrae. We have already published similar results in a

number of papers [113, 114, 116, 117]. The training set has been somewhat extended

since these publications, the AAM parameters have been further optimised, and we

also present previously unpublished data on the optimal choice of sub-model structure

and appearance model form.

6.2 ASM vs AAM

Smyth et al [126] previously used an ASM to segment vertebrae, employing a single

model of the spine from L4 to T7. However we have used the AAM instead. The

AAM is a more statistically principled approach, as it incorporates all correlations

between texture at different points, and also models the correlation between shape

and texture via a tertiary PCA. Also as the AAM can match to a range of actual

textures, it is better suited than the ASM to the problem of segmenting vertebrae,

where the associated image texture can be quite variable, and in fact correlates with

shape. For example fractured vertebrae have not only a different shape, but the

104

Chapter 6. Vertebral Segmentation using Multiple AAMs

appearance of the endplate is different to that of a normal vertebrae: the fractured

endplate has a more diffuse, often multi-edged, texture [79] (see Figures 6.4 to

6.7); whereas a normal vertebra would have a clearly defined margin. An ASM in

contrast would try and adjust its shape so that each point located a position that

best matched the mean texture, whereas the AAM adjusts its appearance parameters

to best match to the actual texture of the image. Therefore this study used AAMs

rather than ASMs. In fact in further work by Smyth [125], in order to use an ASM

with fractured vertebrae it was necessary to introduce additional edges into the shape

model. However these essentially became degenerate in the case of normal vertebrae.

But with an AAM the surrounding texture (e.g. multi-edged structure) can more

naturally be incorporated into modes of the texture model.

Another group working with lumbar radiographs originally used the ASM as part of

a hierarchical segmentation process [143], but more recently have changed to using

the AAM instead [77]. On the other hand, as discussed in the previous chapter, the

ASM can have a larger convergence zone than the classical AAM, due to sampling

outside of the current shape along profiles. We attempt to have the best of both

worlds by using a profile AAM.

Furthermore the AAM can also combine prior estimates of shape points with the

image grey level residuals to perform Constrained AAM search [31]. In the previous

chapter we showed how this can be used to link multiple AAM sub-models together.

In this chapter we use these multi-AAM methods to segment vertebrae.

6.3 Data - DXA Images

6.3.1 Summary of Training Set

Assessment of vertebral fractures by spinal radiographs is still the definitive method

of determining vertebral fracture. However DXA images are being increasingly used,

because although the images are noisier and of lower spatial resolution, they have

several advantages, as discussed in chapter 3. DXA involves a substantially lower

radiation dose than conventional spinal radiographs, avoids the projectional parallax

effects of conventional radiographs due to the divergent X-ray beam, and has the

whole spine on a single image. See chapter 3 for further details. Typical images are

105


illustrated in Figures 6.1 to 6.2. These Figures also illustrate the annotated shape

model points, and include some more unusual shapes and fractured vertebrae.

Because the thoracic vertebrae are more clearly identified on dual energy images,

dual, rather than single, energy images have been used in the study. The anonymised

DXA images used were obtained from two previous studies [126, 94] with institutional

ethical approval, together with anonymised images from recent patients aged 65 years

and older referred for bone densitometry, and for whom the referring physician had

requested DXA vertebral fracture assessment. We obtained the dataset of Smyth et al

[126] consisting of 78 lateral spine dual energy DXA images in women (mean age 61,

age range 44-80 years), acquired on an Hologic (Bedford, MA) QDR2000plus scanner.

The dimensions of each pixel of the scan were 0.9 x 1.0 mm. We also annotated further

images in the same dataset, which had not been used before, as the previous study

focussed on normal, and not fractured, vertebrae. We added a further 46 images from

this source, 31 of which included fractures or borderline deformities. We obtained a

further set of anonymised DXA images that had been used in a previous study into

the efficacy of clodronate [94], using an Hologic QDR4500A scanner. The dimensions

of each pixel of the scan were 0.5 x 0.5 mm. This gave a further set of 78 images

for inclusion. Of these, 59 included vertebral fractures or borderline deformities. We

selected a further 158 images from patients attending for DXA BMD measurement,

using an Hologic QDR 4500 Discovery, with a resolution of 0.35mm. These patients

had a propensity to osteoporosis, and the specific images were selected because they

contained fractures or other deformities helpful in training the shape model. Patients

in the clodronate study tended towards obesity. DXA images of obese patients give

a particularly poor signal to noise ratio, especially in the lumbar spine. There were

10 images of extremely poor quality in the clodronate study which were not used

for testing, but still partially used for training models, as these images contained a

high fracture prevalence. This left a test set of 350 images, with 512 fractures or

deformities.

6.3.2 Shape annotation

The images were annotated manually using an in-house tool. This was written in the

ANSI C++ programming language using a combination of: i) the Trolltech Qt library

(Trolltech ASA, Oslo, Norway) to provide the user interface; ii) proprietary in-house

statistical modelling software. The tool partly used previous Active Appearance

106


Figure 6.1: The left hand image is a good quality DXA image, with the manually an-notated shape superimposed next. Rightmost is a typical DXA image with moderate di-aphragm motion artefacts (T10,T11), with the standard morphometric 6 points indicated.

107


Figure 6.2: The left hand image is a DXA image showing fractures of T12 and T5 (possiblyT7). Next the same image is shown with the full annotated shape, followed by an image withobvious fractures of T10 and T12. Rightmost is an image with unusual lumbar curvature

108


Models to reduce the manual labour, with the models being iteratively updated as

the annotated cases were added to the dataset. A useful feature of this tool is that

when points have been manually repositioned, they constrain other points in the

model; so if the initial automatic fit is incorrect, the user can partially correct it and

then re-run the model fit as constrained by the corrections (the Constrained AAM is

used with high constraint weights on the manually positioned points).

There can be vertebrae to which the provisional shape model cannot be adequately

fitted, because of undertraining. Therefore we also introduced a dynamic program-

ming edge search method into the tool. This first fits the vertebra’s shape model∗

learned so far to points already positioned by the user. Typically this will not fit pre-

cisely due to undertraining, so next the whole shape is warped using thin plate spines

so that the shape exactly passes through all points fixed by the user. Next a profile

search 5 mm either side of unconstrained points, and 1mm either side of constrained

ones is conducted to try and obtain a maximum sum of local (along profile) absolute

gradients - i.e. we look for the strongest local edge on each point. However high

curvature is penalised, and the sum of gradient strength and curvature penalty can

be straightforwardly solved by dynamic programming. Hence the vertebra’s shape is

updated. If necessary the user can then fix a further subset of points. The edge seek

can have a tendency to overfit - it is difficult to get a penalty curvature coefficent

that works ideally on every image. So the final refinement is typically for the user

to fix certain points which are regarded as well positioned, and then use a “Snap to

Model” feature which just implements the first part of this algorithm (fit the current

shape model, then use thin plate spline warps to exactly fit all manually positioned

points). As a rough guide it was found to generally be adequate to manually fix

around one point in 4, plus all points going around vertebral corners. Hence the tool

is not constrained by the prototype models in cases in which there is an inadequate

fit. The initial annotation was performed by the author (MR), with checking by an

experienced radiologist (JEA), particularly in difficult cases (e.g. fractures). The

later 158 image clinical dataset was annotated by an experienced radiographer (SC),

supervised by the author.

We also modelled the linkage of the vertebral bodies with the pedicles, as the inferior

margin of the pedicle is continuous with the posterior inferior margin of the vertebral

body. However, pedicles above T10 were not so modelled as the angulation of these

pedicles results in their being poorly visualised on a lateral scan. Although the

∗strictly the shape model of the sub-model determining the vertebra

109


Figure 6.3: The left hand image shows all model points around a lumbar vertebra, in-cluding pedicle connections. The middle shows the annotation of a T10 fracture, and onthe right that of a severely fractured T9

pedicles are not of diagnostic interest, including the anchoring edges may help to

stabilise the AAM search by providing further structure in the model (Figure 6.3).

Since the posterior part of a vertebra is less prone to osteoporotic fracture, it was

thought that this anchoring of the posterior portion of the vertebrae would help to

stabilise the search, particularly with severe endplate or wedge fractures. Provisional

studies using only the dataset of Smyth et al [126] indicated a marginal improvement

in accuracy could be obtained by including the pedicles. Each vertebral contour uses

40 points around the vertebral body with 8 further points around the pedicles for

L4-T10 (see Figure 6.3), and 32 points per vertebra for T9-T7.

It can sometimes be rather ambiguous where to annotate fractured vertebrae. Some-

times there appear to be two or three edges: the remaining cortical rim, and a diffuse

lateral picture of the collapsed endplate. In such cases we attempted to consistently

place the vertebral contours on the outer edge of the endplate (but not the cortical

rim). As discussed above the variation in appearance should in principle be captured

as different texture modes in the appearance model.

6.3.3 Point correspondence

It would be possible to adjust the manually annotated shapes using the MDL ap-

proach [40] to further optimise the correspondences between the annotated points

across the training set images. We have simply retained the manually annotated

shapes for several reasons. Firstly. although there are some difficult cases, the box-

like structure of vertebrae generally means that there are some obvious landmarks

at the vertebral corners, connection with the pedicle, and typically also the endplate

110


Figure 6.4: Illustration of shape annotation of a fractured lumbar endplate. Although aremnant of cortical rim is still visible, the shape points are annotated on the more innerendplate border

Figure 6.5: Illustration of shape annotation of a fractured lumbar endplate. Although aremnant of cortical rim is still visible, the shape points are annotated on the more innerendplate border

Figure 6.6: Illustration of shape annotation of a fractured T12 endplate. Although aremnant of cortical rim is still visible, the shape points are annotated on the more innerendplate border

111


Figure 6.7: llustration of shape annotation of several fractured vertebrae. Althoughremnants of the vertebrae’s cortical rims are still visible, the shape points are annotatedon the more inner endplate borders. Note the generally fuzzier edge structure of fracturedvertebrae.

mid-points as used in standard 6-point morphomentry. We are not dealing with amor-

phous 3D blobs where establishing a set of point correspondences is a major problem.

Secondly because we have always started with an existing shape model - originally

from the dataset of Smyth et al [126] - our annotation tool uses an existing model

to initialise points which have not been explicitly positioned, together with a small

degree of thin plate spline warping to force the model fit to the manually positioned

points. As discussed above this has been used in conjunction with local edge fitting.

So we have never been in the position of simply equi-positioning points along a curve,

as an existing shape model (gradually refined) has been used throughout. The model

was regularly rebuilt as the annotation proceeded, so that some degree of groupwise

correspondence has been implicitly induced.

Thirdly the MDL method adjusts point locations around an existing segmentation,

which is typically obtained by a manual process. Given the noisy nature of DXA data,

and the often multi-edged appearance of fractured vertebrae; together with other

complicating pathologies such as osteophytes on the anterior corners; we consider

that correspondence errors in the original segmentation are likely to exceed any errors

in point placement around the rim of the segmented shape. So the gain from MDL

112


may be lost in the other “noise” inherent in the original segmentation. Fourthly

in order to maximise the opportunity to train the models on as many fractured

vertebrae as possible, we have evaluated model fitting performance using a “leave-

N-out” train/test loop. In order to avoid bias, we would need to run the MDL

optimisation algorithm for each train/test split. This would be a costly computational

overhead, made worse by the fact that the original MDL optimisation algorithm was

designed for single part shapes; using multi-part shapes (or shapes not homotopic to

a circle/sphere) introduces an additional layer of combinatorial complexity. Since it

already takes a substantial amount of computation time (several hours) to build all

the Active Appearance Models†, we believe that the computational overhead would

not justify marginal improvements in the shape models.

Finally we are not really considering shape as the only criterion, as we are interested

in how well appearance models can be automatically fitted to unseen images. We

would really need to consider an MDL approach to the whole appearance model.

The development of this is still continuing. Although there has been recent progress

[29] in building fully automatic appearance models, we do not believe that this tech-

nology is sufficiently mature to be reliably used on difficult and noisy data like DXA

images. Also in [29] the correlation between shape and texture was not modelled, but

that is important in the form of (profile) appearance models that we use. Moreover

the pathological cases require some insight into the physical structures, whilst the

repetitive edge structure of vertebrae means that it would be easy for a completely

automatic process to fall into local minima involving fitting onto the converse edge

of a neighbouring vertebra - for example the superior edge of T12 might be assigned

to the inferior edge of T11, especially when there are diaphragm motion artefacts

present, or disc disease may obscure the inter-vertebral space.

We have examined our shape model modes, and not observed any unphysical char-

acteristics within allowed parameter ranges, such as might be caused by poor point

correspondence. Later in this chapter we also present results in shape model general-

isability, which appears to be adequate except for a small number of severe fractures.

Therefore we do not believe the use of the MDL method to be justified in this case.

†we use a 20 node cluster as it is

113


6.4 Initialisation

It is necessary to define how the starting solution for model searches is to be con-

structed. When the algorithm is run interactively in an associated prototype clinical

tool, the clinician initialises the solution by clicking on a number of points with the

mouse. We experimented with two methods of initialisation. In the first method

(adopted from Smyth [126]) the clinician clicks on the bottom, middle and top of

the spine. More specifically these points were the middle inferior of L4, middle supe-

rior of T12 and middle superior of T7. The least squares global shape model fit to

these 3 points is used as the starting solution.

It was noted that the clinician would essentially move the mouse pointer up the spine

counting vertebrae during the initialisation, and so it would take a little more time

to simply click instead in the approximate centre of each vertebra in turn. This

might provide a better initialisation, especially in cases where high kyphosis (often

accompanied by fractures) or scoliosis produce departures from the typical 3-point

prediction.

When using the 10-point initialisation (centre of each of 10 vertebrae) there are

some additional technical details. Firstly there is a slight complication in using the

underlying C++ shape model code, because the points used for initialisation are not

strictly speaking contained in the shape model itself. But essentially they are linear

combinations of other points, with an inherent additional variance (pointer error).

The centre x(c)v of vertebra v can be defined as

x(c)v = 0.5(xvs + xvi) + e (6.1)

where xvs is the superior midpoint of vertebra v and xvi is its inferior midpoint, and e

is a random error, which we assume has zero mean and SD of 2mm in the y direction

(double that of an edge point) and the same 3mm in the x-direction. Then the

covariance matrix of the extended vector of the original points plus these manually

placed centre-points can be deduced from this linear combination and the underlying

covariance matrix of the shape model. This extended covariance matrix can then in

turn be diagonalised to produce an extended shape model for initialisation. It is this

extended shape model which is used for the initialisation as follows.

114


1. Firstly this extended global shape model is used to obtain the model’s best fit

to the input set of vertebral centres. However, conditions such as scoliosis or

high kyphosis can cause the global model shape fit to be poor at particular

vertebrae.

2. So next the individual vertebrae are rigidly translated back to be centred on

their input positions, and then each sub-model is fitted to these translated

points in turn, by moving up the spine from L4. This gives a closer fit of

the centres to the user’s input, but can occasionally violate consistency. In

particular a vertebra could overlap with a neighbouring one.

3. There is a final correction phase, where if any overlaps are detected, the over-

lapping vertebrae are both reduced in height in proportion to the local extent

of their overlap, until they are separated. Points are moved along the local

normals (estimated from a Bezier spline) until they are separated by at least

1mm for thoracic vertebrae, and 2mm for lumbar vertebrae.

Two iterations around steps 2 and 3 are performed.

6.5 Experiments

6.5.1 Summary of optimal AAM determination

The overall optimisation problem is to select the best combination of sub-model

partitioning, fit sequencing and initialisation, and AAM form, together with various

associated parameters (e.g. profile length). It is not computationally tractable to

determine the overall best combination. So we have split the problem up into a

number of stages, at each stage determining the best approach and carrying it forward

to the next.

Since we already knew from our earlier work [113] that the sub-model triplets perform

well, we performed the initial experiments using vertebral triplets as the sub-model

structure, with the 3-point initialisation method. In this first stage we determined

the appearance model form to use, also addressing the question of whether a dynamic

ordering of the fitting (see section 5.3.3) performed better than a fixed static ordering

(from L3 to T8). For the dynamic method we used Nc = 3, with the initial candidate

115


list containing the sub-models corresponding to the 3 initialisation points, i.e. the

triplets centred on T8, T12 and L3.

Secondly we compared the two initialisation methods and picked the better for sub-

sequent use. When using 10-point initialisation we increased Nc to 5, with the initial

candidate list containing alternate sub-models starting with the top-most (i.e. triplets

centred on T8, T10, T12, L2 and then L3 as before). This was done because, although

it made sense with 3-point initialisation to start with a candidate set corresponding

to the more restricted set of initialisation points, with 10-point initialisation we have

as good initial information on every sub-model, and are constrained only by the

algorithm run-time. We subsequently refined certain appearance model parameters.

Thirdly we compared different methods of applying point constraints: a simplistic

diagonal matrix, or a full covariance matrix derived from bootstrapped estimates of

prediction errors.

Fourthly we compared different sub-model structures, also comparing to a single

global AAM. We were able to determine an optimal ‡ sub-model structure.

Finally we investigated the improvement that can be obtained with fractured ver-

tebrae by certain modifications to the sub-model sequencing algorithm (see section

5.3.3), using multiple initialisations of the sub-model shape.

In each set of experiments “miss-8-out” tests were performed over the dataset. On the

earlier experiments the 3-point user initialisation was emulated by using the known

equivalent marked points and adding random offsets to them. These were zero-mean

Gaussian errors with SD of 1mm in the y-direction (along the spine) and 3mm in the

x-direction. In later experiments using a 10 point initialisation on vertebral centres,

the centre point was estimated from the manually annotated data as the mean of

the two (superior and inferior) mid-points; and then similar Gaussian errors were

added, but with the SD in the y-direction increased to 2mm, as the centre is harder

to estimate than the location of an edge. Ten replications (i.e. random initialisations)

of each image were performed.

The accuracy of the search was characterised by calculating the absolute point-to-line

distance error for each point on the vertebral body. This is computed by fitting a

smoothed Bezier spline to the annotated shape (which is treated as a gold standard),

‡strictly speaking we determind a close-to optimal

116


and then for each segmented point we calculate the closest distance to this spline

(within the same vertebra). We computed the mean, median, and 95th percentiles

of the pooled point error data. The number of points which failed to be fitted ad-

equately was assessed using a cut-off threshold of 2mm on the point-to-line error,

which would be 2.5 standard deviations of manual precision (c. 0.8mm) for patients

with osteoporosis [57]. As well as accuracy we also evaluated the precision, which

measures the self-consistency of differently initialised segmentations, rather than how

well they conform to a gold standard. Precision figures were also calculated as a mean

absolute point-to-curve error, but taking the reference curve to be the mean of the 10

segmentations from the 10 different initialisations. Also one degree of freedom must

be deducted so we divide the total error by 9 in this case to obtain the precision.

A potential failure mode, particularly when fitting to fractured vertebrae, is for the

AAM to fit the top edge of a vertebra to the bottom of the vertebra above, or vice

versa (see e.g. Figure 6.8). This may happen when the initial solution is much

closer to the neighbour, because the fracture has collapsed the correct edge far from

the initial solution. We evaluated the number of cases in which this happened as

follows: if a point is more than 0.5 mm closer to the neighbour’s contour than to the

correct contour, and is within 2mm of the former, then this is flagged as a potential

misalignment. If at least 5 points along either the top or bottom 12 contour points

are thus misaligned, then the respective edge is counted as a misaligned edge. Note

that a vertebra can have two misaligned edges (i.e the top has been fitted to the

vertebra above, and the bottom has been fitted to the vertebra below).

6.5.2 AAM form used

6.5.2.1 Summmary of AAM types

We experimented with several forms of AAM. The first distinction to be made is

between “classical” triangulated regional AAMs, which sample within the convex

hull of the shape; and profile AAMs, which (like the earlier ASM) sample along

normals to the current shape, defined at each shape model point. The profile AAMs

may have a larger convergence zone, as the profiles extend over some region outside

of the current shape, just as it has been found [25] that the ASM can have a larger

convergence zone than a regional AAM. On the other hand the regional AAM has

the advantage of consistently defining a shape-free patch, by warping the sampling

117


Figure 6.8: A misaligned AAM fit to T8. The superior edge has been fitted to the inferioredge of T7 (black arrows), whilst the anterior part of the inferior edge has been fitted ontothe superior edge of T9 (white arrows). Only the 6 standard morphometric points areshown to avoid obscuring the vertebrae. Note that T7 has also been displaced, and theanterior points of T9 are affected.

118


region onto the mean shape via a set of triangulations (each of which can use an

affine transform). There would appear to be little useful information in the interior

of a vertebra, as the texture tends not to change much, or not consistently, within

a vertebral body. This is not like for example a human face, where possibly subtle

shading around the nose contains clues about where other parts of the face might

be. Given also the earlier success of the ASM-based work of Smyth et al [126], we

anticipated that profile AAMs would probably perform more accurately than regional

AAMs. We have experimented with the following types of AAM:

1. Profile AAMs sampling raw grey level

2. Profile AAMs sampling renormalised grey level

3. Profile AAMs sampling renormalised gradient (along profile)

4. Triangulated regional AAMs sampling raw grey level

5. Triangulated Regional AAMs sampling renormalised grey level

6. Triangulated Regional AAMs sampling renormalised 2D gradient

7. Triangulated Regional AAMs sampling corner/edge measures [123] derived

from the structure tensor ∇I∇IT §.

In these initial experiments a nominal profile length of 6mm (either side of the shape)

was used (at the highest resolution). A similar sampling scale was used with regional

AAMs, resulting in the use of 4000 pixels per vertebral triplet (at the highest resolu-

tion).

AAM search is usually performed using a pyramid of successively smoothed images,

with a commensurate reduction in resolution. We used a 2-level pyramid, starting

at half the full resolution, because the image features were not large enough to be

preserved beyond half the original resolution. Note that for profile AAMs this means

that the search starts off using a real profile distance double that of the original (i.e.

12mm), but sampling a smoothed (then sub-sampled) image at twice the final step-

size. This is also the same degree of image pyramid as that previously used by Smyth

et al [126].

§with the Cartesian image gradient ∇I as a column vector

119


6.5.2.2 Texture normalisation

There can be problems in using raw grey levels, where the brightness or contrast vary

considerably across the image, or where a few particularly bright outliers may wreak

havoc with the AAM’s underlying least squares error norm. Bosch et al [15] used

a renormalisation to transform the asymmetric pixel-intensity distribution of ultra-

sound cardiac images to a Gaussian. This gave significantly improved AAM matching

results. We noted in chapter 4 that Cootes had used a sigmoidal renormalisation of

image gradient [32], see equation 4.27. However the functional form used by Cootes

and also later by Scott et al [123] is not appropriate for simple grey levels, as the

background has a non-zero expectation value, and the mean is unrelated to the con-

trast scale. Therefore for renormalising grey levels we have instead used a mapping

based on the signed square root of the Geman-McClure robust error function [61],

which like a sigmoidal form applies a standard normalisation to (-1,1) whilst tending

to truncate any extreme values. This maps a sample value gi of pixel i to g′i given

by:

gi′ =

gi − µ√(gi − µ)2 + (βσ)2

(6.2)

where µ is the sample median and the scaling factor σ is a robust estimate of the

standard deviation of the background derived from the median absolute deviation

(MAD) about the median as σ = 1.4826MAD [76]. The multiplicative factor of

1.4826 arises because it can be shown, that assuming a Gaussian distribution, the

MAD is related to the standard deviation as :

MAD

σ= Φ−1(

3

4) (6.3)

where Φ is the cumulative distribution function for the standardised Gassian distri-

bution.

The multiplier β is used because the Geman-McClure influence decreases after σ√3,

and so it is common to use β =√

3, to allow increasing influence out to one standard

devation We used this β =√

3 setting, which normalises the absolute response to be

0.5 at one standard deviation.

For the gradient models we used the sigmoidal form of equation 4.27, except that

120


for one-dimensional gradients along a profile the magnitude of the gradient vector is

replaced by a simple absolute value.

6.5.2.3 Point constraint form

In all initial experiments we assumed a diagonal form of the point constraint matrix

SX in equation 4.31. The constraint weights used were the same as were used in

our original work [113], and are the inverses of the variances associated with the

following error standard deviations (mm) used for the core, overlap and predicted

points respectively (see also section 5.3.5).

σa = 1.0 (6.4)

σO = 2.5 (6.5)

σI = 5.0 on the first iteration, then (6.6)

σI = 3.5 on subsequent iterations (6.7)

Note that the prediction standard deviation is reduced from σI = 5.0 to σI = 3.5

because after the first iteration the prediction will always be based on at least one

fully fitted neighbouring vertebra, not just the initial user click-points.

6.5.2.4 AAM-form results and discussion

The accuracy results for the point-to-line error measure with the various AAM forms

are given in tables 6.1 to 6.5. Each row gives the mean, median and 95th percentile,

the percentage of errors in excess of 2mm, and the percentage of (superior or inferior)

edges erroneously fitted onto the converse edge of a neighbour. The tables give results

separated into normal and fractured vertebrae, with the latter further subdivided by

fractured grade.

It can be seen that using a dynamic ordering of sub-models generally improves the

accuracy. Profile AAMs outperform classical triangulated AAMs overall even when

using the more sophisticated “feature AAM” of Scott et al [123]. It appears that by

focussing the texture model on the most informative regions, and sampling further

outside the current shape, the AAM performance is improved. No doubt this also re-

flects the lack of interesting structure in the middle of vertebrae. However, comparing

121


the triangulated corner-edge AAM with a profile gradient AAM, although the latter

generally outperforms the former, the triangulated corner-edge AAM does give better

results with grade 3 fractures. This may be because the corner-edge AAM is somehow

more robust against local minima on the edges of neighbouring vertebra. Neverthe-

less we adopt a renormalised gradient profile AAM for subsequent optimisation, and

subsequently substantially improve its performance against severe fractures.

Referring to the results for the various regional AAMs, simply renormalising the

texture works almost as well as the sigmoidal gradient AAM. Both forms of renor-

malised feature (gradient and corner-edge) AAMs give similar results. The gradient

AAM may be marginally more accurate with normal vertebrae, but the corner-edge

AAM outperforms it against grade 3 fractures, probably because of a longer conver-

gence region due to the larger spread of the feature induced by the regional smoothing

of the structure tensor implicit in the corner-edge measures. It is not surprising that

the corner-edge AAM performs slightly worse in most “normal” cases, as using a

measure based on quadratic gradients is bound to boost the noise variance, and also

makes the variance a function of the gradient level. This tends to further boost the

variance of significant feature points (prior to normalisation), although it is difficult

to assess what effect the subsequent normalisation then has. Perhaps when using

feature AAMs it might be better to modify the AAM objective function to min-

imise the residual-variance normalised sum of squares, in order to compensate for the

variance-boosting effect of the feature extraction. But in difficult cases (i.e. severe

fractures) where the initialisation is not as good, it appears that the corner-edge

feature-extraction process may give a larger convergence zone. We have refrained

from combining both gradient and corner edge feature sampling, because these more

complex AAMs have a substantially longer compute time for both search and espe-

cially AAM training. In the absence of evidence that they offer much real gain over

simpler gradient profile models, we have continued with the latter.

The dynamic ordering method is significantly better. Results cannot be directly

compared using standard parametric tests such as the t-test, because multiple repli-

cations are not independent, and the errors are far from being normally distributed

due to long error tails caused by search failure. However for matched samples (same

image set and initialisations) we can perform hierarchical bootstrap resampling on

the differences between corresponding points, as demonstrated by Scott [122]. This

allows bootstrapped confidence intervals to be constructed for the mean difference

between two methods. If for example such a 98% confidence interval does not span

122


Sequence Vertebra Search Error StatisticForm Status Mean Median 95%-ile %ge errors %ge edges

(mm) (mm) mm >2mm misalignedNormal 0.95 0.58 1.86 8.8% 1.7%

Dynamic Fractured 1.80 0.84 5.14 25.0% 12.8%Sequence Grade 1 1.26 0.70 3.01 16.3% 8.1%

Grade 2 1.53 0.77 3.89 20.9% 10.0%Grade 3 2.79 1.33 7.78 40.7% 30.0%Normal 1.17 0.62 2.58 13.0% 3.1%

Static Fractured 2.12 0.93 6.08 29.0% 14.8%Sequence Grade 1 1.41 0.73 3.62 18.8% 10.0%

Grade 2 1.98 0.85 5.40 26.3% 12.3%Grade 3 3.15 1.60 8.10 44.6% 32.8%

Table 6.1: Search error statistics (point-to-line) for 6mm profile gradient AAM






Grade 2 1.99 0.92 5.24 27.0% 13.8%Grade 3 3.14 1.71 7.97 45.9% 32.9%

Table 6.2: Search error statistics (point-to-line) for 6mm profile intensity AAM

123







Grade 2 2.03 0.96 5.14 27.1% 13.9%Grade 3 3.17 1.71 8.03 45.9% 34.2%

Table 6.3: Search error statistics (point-to-line) for 6mm profile renormalised intensityAAM






Grade 2 2.18 1.04 5.63 31.6% 15.6%Grade 3 3.32 1.91 8.18 48.8% 37.5%

Table 6.4: Search error statistics (point-to-line) for classical region intensity AAM

124







Grade 2 2.11 1.03 5.32 30.4% 15.4%Grade 3 3.28 1.88 7.99 48.3% 37.6%

Table 6.5: Search error statistics (point-to-line) for classical region intensity renormalisedAAM






Grade 2 2.02 0.92 5.22 27.7% 14.3%Grade 3 3.09 1.58 7.82 44.7% 34.4%

Table 6.6: Search error statistics (point-to-line) for classical region intensity sigmoidal 2Dgradient AAM

125







Grade 2 1.98 1.26 4.58 33.8% 15.6%Grade 3 2.36 1.44 5.97 39.0% 28.5%

Table 6.7: Search error statistics (point-to-line) for region corner feature AAM

zero, then this can be interpreted as implying a significant difference at the 2% level.

Further details on this adaptation of hierarchical bootstrap confidence intervals are

given in [122].

The symmetric (in probability) bootstrapped confidence interval for the difference be-

tween static and dynamic fitting sequences for gradient profile models is [0.155,0.279].

This implies a significant difference at the 2% level, as zero is not spanned by the

interval. Similarly for a triangulated regional AAM with renormalised gradient sam-

pling, the corresponding confidence interval is [0.138,0.306], so again the dynamic

sequencing results in a statistically significant improvement at the 2% level.

Comparing now a classical regional AAM with renormalised 2-D gradient to the pro-

file AAM (with renormalised 1-D gradient), we obtain a bootstrapped 98% confidence

interval for the mean difference between the two AAMs (both with dynamic sequenc-

ing) of [0.149,0.259]. The corresponding confidence interval is [0.150,0.292] for the

difference between a regional feature AAM (corner-edge) and this same profile AAM.

Therefore we conclude that profile AAMs outperform triangulated region AAMs in

this application, probably because the convergence zone is increased by using a profile

which samples outside the current shape, whilst there is little useful internal infor-

mation within the inside of the vertebra. However there is some suggestion that the

corner-edge feature AAM may perform better on severe fractures. Later we find other

means of improving the performance of the profile AAM in the severe fracture cases.

126


6.5.3 Initialisation Method and AAM profile length

Having selected profile gradient models as the AAM form to use, we then compared

the accuracy results for 3-point and 10-point initialisation methods. Only the dy-

namic sequencing method is used in these later experiments. Comparing the results

for 10-point initialisation given in table 6.8 with those already given for 3-point ini-

tialisation in table 6.1, it can be seen that using the full 10 points can improve the

accuracy, particularly for the fractured vertebrae. The 98% bootstrapped confidence

interval for the difference in mean error between 3-point and 10-point initialisation

methods is [0.149,0.281]. For fractured vertebrae only, the confidence interval is

[0.514,0.829]. The difference is therefore statistically significant, especially for frac-

tured vertebrae. Therefore the 10-point initialisation method is used subsequently.

We also evaluated several sampling profile lengths, and refined the sampling step size

by scaling relative to the mean vertebral size. We continued with a nominal 1mm for

the lower lumbar vertebrae (L2-L4), but other vertebrae have the step length reduced

in approximately the ratio of their mean mid-point height to that of L2. This results

in a nominal sampling length of 0.75 at T7. We varied the semi-profile length in

steps of 2 from using 6, 8 and 10 steps. Tables 6.8 to 6.10 give the results. There

is a small but significant reduction in mean error in moving from a profile length of

6 to 8 steps - the 98% confidence interval for the reduction is [0.003,0.033] overall,

and [0.051,0.362] on severely fractured vertebrae. There is no statistically significant

difference (at the 2% level) in overall mean error between an 8 step and a 10 step

profile. Also using 10 steps tends to mean that once a vertebra is a grade 2 fracture

or above, the inner samples from opposite sides (inferior and superior edges) tend to

intersect, so we start to introduce spurious redundancy into the model.

We selected 8 steps as the profile length to use.

6.5.4 Point constraint form

As discussed in section 5.3.5 in chapter 5, we had originally treated point constraints

as independent, i.e. the matrix SX in equation 4.31 is diagonal. We now compare

this method to using a full covariance matrix derived as described in 5.3.5 from a

bootstrapped estimate of the overlap point prediction errors. Results for the same

set of miss-8-out experiments are presented in table 6.11.

127


Vertebra Search Error StatisticStatus Mean Median 95%-ile %ge errors %ge edges

(mm) (mm) mm >2mm misalignedNormal 0.74 0.56 1.58 5.1% 0.3%Fractured 1.15 0.70 2.63 14.8% 5.9%Grade 1 0.85 0.60 1.78 8.1% 2.8%Grade 2 1.01 0.66 2.25 12.3% 4.6%Grade 3 1.69 0.93 4.57 26.2% 15.1%

Table 6.8: Search error statistics (point-to-line) for 6 step profile gradient AAM

Search Error StatisticVertebra Mean Median 95%-ile %ge errors %ge edgesStatus (mm) (mm) mm >2mm misalignedNormal 0.72 0.55 1.54 4.7% 0.2%Fractured 1.08 0.68 2.41 13.4% 5.2%Grade 1 0.80 0.58 1.73 6.8% 3.2%Grade 2 1.00 0.67 2.26 12.4% 4.3%Grade 3 1.53 0.84 4.12 22.5% 12.9%




Vertebra Search Error StatisticStatus Mean Median 95%-ile %ge errors %ge edges

(mm) (mm) mm >2mm misalignedNormal 0.77 0.56 1.61 5.8% 0.5%Fractured 1.15 0.71 2.68 15.4% 6.8%Grade 1 0.84 0.61 1.76 7.6% 3.9%Grade 2 1.02 0.70 2.39 13.0% 5.2%Grade 3 1.71 0.93 4.72 27.5% 16.3%

Table 6.11: Search error statistics using full covariance matrix for point constraints

128


Comparing to the results in 6.9 it appears that there is no advantage in using the

more complex model of point covariances. There is a small deterioration in mean

accuracy of around 0.05mm arising from using the more complex non-diagonal form

of SX , rising to 0.2mm on grade 3 fractures. Of course even this non-diagonal form of

SX is somewhat ad hoc, as the true error covariance is unknown, and has a circular

dependence on the constraint form assumed. One problem may be that even with

a non-diagonal form of SX , the AAM formulation assumes that the residual errors

are themselves independent, so equation 4.31 has a diagonal form of texture error

covariance matrix. But, as we noted in section 5.3.6 in chapter 5, the actual dis-

tribution of the normalised residual sum of squares is far from being χ2 distributed.

There is a large departure in both mean and variance from that expected. We specu-

lated that under-training and deficiences in the AAM update matrix R will introduce

spatial correlation between nearby residuals. Perhaps there is no point in using more

complex models for SX , whilst using a simple diagonal form of the texture residuals’

Mahalanobis distance. Also as the actual residual variance is underestimated, there

is always a problem in using the correct commensurate scale between texture error

and point spatial constraints.

It seems that given the more fundamental underlying problem of how to formulate

an AAM objective function given the unknown residual distribution, the simpler

diagonal constraint form of equation 4.33 works well, as to some extent the errors in

both diagonal assumptions cancel each other out.

In other applications which use our sub-model combination approach it might be

more worthwhile to use a full non-diagonal point covariance matrix in the manner we

have developed, especially if there is less (or no) overlap between the sub-models, and

the constraints arise only from predictions through the global shape model. However

in such case some further rescaling of the texture error residual sum of squares, such

as we introduced in section 5.3.6, should be trialled, in order to partially compensate

for ignoring “model noise” and residual spatial correlation.

It might also be possible to slightly improve our current results by attempting to

optimise the constraint weights used, especially as we know that the assumed residual

variances will be incorrect. However there is also the danger of over-tuning parameters

to a particular dataset, so we have not attempted to do this. We have continued to

use the initial point constraint weights of [113] (see also section 6.5.2.3).

129


6.5.5 Optimisation of the sub-model structure

Having determined a reasonably good form of triplet profile model with 10-point ini-

tialisation, we then experimented with using this model form with other sub-model

structures. We evaluated sub-models using single vertebrae, a central vertebra plus

neighbouring semi-vertebrae (semi-triplets), vertebral triplets as before, quintets (five

vertebrae), and once again a full global model. Note that when using quintet struc-

tures we continued to use a triplet at both extremes of the spine (T7-T9 and L2-L4).

Tables 6.12 to 6.14 present the point-to-line error statistics for the three alternative

evaluated sub-model structures, and 6.15 shows similar results for using a single

global AAM. Note that the comparable results for triplet sub-models have already

been given in table 6.9. Single vertebrae sub-models do not appear to be reliable,

especially with fractured vertebrae. One problem with them is the rather high pro-

portion of cases in which a vertebral edge is confused with the converse edge of a

neighbour. It seems that not including any information about the neighbours in the

sub-model does not allow the search process to distinguish between the correct edge,

and a local minimum on the converse edge of a neighbour. Once we reach the semi-

triplet structure, there is a substantial increase in accuracy, especially for fractured

vertebrae. By including some information about neighbouring vertebrae the models

and the search process are better constrained, and better able to distinguish between

the correct vertebrae and erroneous solutions on neighbours. There is a further small

improvement moving up to triplets, and again with the quintets. In the latter case

the main advantage appears to be with the severe (grade 3) fractures, although the

relatively small numbers of vertebrae involved means it is difficult to be sure. The

98% confidence interval for the reduction in overall mean error moving from triplet to

quintet structures is [-0.003,0.027]. So the overall difference is not significant at the

2% level, but if we look at the difference only for fractured vertebrae the confidence

interval changes to [0.001,0.103], and so there appears to be a marginally significant

improvement.

If we continue to the largest possible structure and compare the better-performing

sub-model structures with the global AAM, we see that use of a single global AAM

gives substantially poorer results for fractured vertebrae. Comparing results for nor-

mals, the global model is a little worse than triplets or quintets, but this difference

is much smaller than we found with earlier results on a smaller (78 vertebrae) train-

ing set [113], for which the mean error on normal vertebrae reduced from 1.28mm

130


(global AAM) to 0.88mm (triplets). It appears that when the training set is suf-

ficiently expanded much of the advantage of the sub-model approach disappears in

largely “normal” cases. However even with a much-expanded training set the global

AAM is not able to cope with fractures. Even on grade 1 fractures we obtain rather

poor accuracy. This is not just a matter of shape-model undertraining. Tables 6.16

to 6.18 compare the underlying shape model re-fitting accuracies for triplets, quin-

tets, and a global shape model. These are obtained by randomly selecting 16 images

as a test set, and then training shape and appearance models on the remaining data.

The shape models are then refitted to the manually segmented points ¶, and residual

point-to-line errors calculated. The global shape model is less accurate, with around

double the mean errors of the triplet sub-models. There is a small deterioration in

the intrinsic shape model accuracy from triplets to quintets, but this is not reflected

in the AAM search errors, which are dominated instead by errors in locating the

correct solution.

Although the underlying global shape model is less accurate, it is clear that the global

AAM fitting errors are larger than can be explained by the shape model deficiencies

alone. For example the mean segmentation error with grade 2 fractures is 1.4mm,

compared to 0.7mm in the intrinsic shape model accuracy. With the global model

there are 18.5% of points with segmentation errors beyond 2mm, whereas only 4.9%

of points would have this degree of error due to shape model undertraining alone.

It may be that the global AAM search is fundamentally biased towards the prior

of the shape models - i.e. the mean (largely normal) shapes - and cannot cope

well with pathologies. This may partly be due to inadequacies in the AAM update

matrix, and partly due to other difficulties with global texture, such as the problem of

normalisation across an image with varying brightness/contrast‖. For example Figure

6.9 shows an image with a huge difference in brightness between the thoracic and

lumbar vertebrae. The thoracic vertebrae are invisible on the original image, though

can be just seen with locally enhanced contrast. Using sub-models makes this kind

of contrast adjustment more adaptable, and means that the sampled renormalised

structure is more likely to fit a linear model.

¶strictly speaking the appearance model constraints are applied in addition to the shape con-straints

‖see also Figure 4.4 which shows that a major mode of the global APM is brightness variationacross the image

131


Figure 6.9: There is a large variation in brightness across the original (leftmost) image.Local contrast enhancement (right) is necessary to reveal the mid-thoracic vertebrae


Table 6.12: Search error statistics (point-to-line) for single vertebra sub-models

132



Table 6.13: Search error statistics (point-to-line) for semi-triplet sub-models


Table 6.14: Search error statistics (point-to-line) for quintet sub-models


Table 6.15: Search error statistics (point-to-line) for single global model

Search Error StatisticVertebra Mean Median 95%-ile %ge errors %ge errorsStatus (mm) (mm) mm >1mm >2mmNormal 0.20 0.15 0.56 0.4% 0.0%Grade 1 0.24 0.18 0.67 0.9% 0.0%Grade 2 0.34 0.24 0.96 4.4% 0.5%Grade 3 0.49 0.34 1.39 12.3% 1.1%

Table 6.16: Shape Model Intrinsic Accuracy for Triplet sub-models

133



Table 6.17: Shape Model Intrinsic Accuracy for Quintet sub-models


Table 6.18: Shape Model intrinsic accuracy for a single global model

Figure 6.10: Mean point-to-line errors (mm) by vertebral fracture grade, comparingquintet sub-model AAMs to a global AAM. The associated bootstrapped 98% confidenceintervals (quintets first,global second) are as follows. Normal Vertebra: [0.693,0.737] vs[0.777,0.834]; grade 1 [0.690,0.836] vs [0.910,1.46]; grade 2 [0.875,1.15] vs [1.17,1.71]; grade3 [1.17,1.63] vs [1.79,2,56].

134


6.5.6 Multiple Initialisations for Fractured Vertebrae

We have already noted that an AAM may fail to converge on a severely fractured

vertebra, because instead it locates a local minimum by fitting to the converse edge

of a neighbour(s). Although this occurs less with the best AAM form, quintet sub-

models, and 10-point initialisation, there remains the possibility of edge confusion.

These local optima may be more likely when a severe fracture has collapsed the central

vertebra so far from the AAM’s initial solution that the neighbouring edges are more

likely to be within the starting solution’s sampling profile. Another possibility is that

the true solution is not reachable using the AAM update matrix because it is too

far from the displacement set that lies in the linear region validly modelled by the

trained Jacobian. We therefore investigated the use of two alternative initialisations,

which were chosen to represent moderately and severely fractured vertebrae.

To do this, we extended the sub-model combination method by adding two alternative

starting solutions for each sub-model in the (size Nc) candidate list. At each iteration

two alternative fractured variants are generated (see next paragraph) for the central

vertebra in each candidate sub-model. Each of these 3Nc AAM searches is now

tentatively run. The best fitting of all these 3Nc searches is selected as the solution

to impose at this iteration, and its submodel is removed from the list of current

candidates. In other respects the method proceeds as in chapter 5, section 5.3.4. So

having picked the best of the 3Nc candidiates, the points of the central vertebra of the

sub-model are saved in an overall solution vector, as are points in either neighbour

which have not been previously determined. Likewise the central points now have

high constraint weights attached, to ensure consistency for subsequent searches with

neighbouring sub-models. Finally a new candidate sub-model is now added as before

(if possible), and this completes an iteration of the fitting sequence. As noted above

in section 6.5.1, we use Nc = 5, so in this variation we run 15 AAM searches during

the initial iterations. Obviously there is a penalty in computation time. On an

Intel Centrino laptop the computation time increases from around 20 seconds to one

minute, though this could be reduced on multi-core processors by running parallel

execution threads.

The alternative fractured initialisations of the central vertebra are generated thus.

Firstly a lower bound is calculated for the posterior height. The McCloskey prediction

method is used [95] (and see also 3.3.2 in chapter 3) to calculate a predicted posterior

height Hp(pred) from the current solutions for the 4 nearest neighbours. The lower

135


bound is then set to

Hp(min) = 0.75Hp

(pred) (6.8)

Next the posterior height Hp of the sub-models’s central vertebra is reduced by 15%,

to allow for a potential misfit onto both neighbours in the worst case, and then

reduced by a further factor rp(g) depending on required fracture grade g, but subject

to the lower bound above.

So given the current posterior height Hp the new height is set to

Hp′ = max(0.85rp

(g)Hp, Hp(min)) (6.9)

Then the mid-height and anterior height are reduced using standard factors rm(g) and

ra(g) respectively. So

Hm′ = rm

(g)Hp′ Ha

′ = ra(g)Hp

′ (6.10)

We used the following reduction factors

rp(2) = 0.925 rm

(2) = 0.7 ra(2) = 0.825 (6.11)

rp(3) = 0.8 rm

(3) = 0.55 ra(3) = 0.7 (6.12)

Having calculated the desired new heights, the posterior, middle and anterior refer-

ence points are moved equidistantly along their current separation vectors to achieve

the required respective heights (as their new separation distance). This fixes the

corners and mid-points in the central vertebra (i.e. the standard 6 morphometric

points). The corners and mid-points in neighbouring vertebrae are temporarily fixed

at their current values, and the alternative solution (for the central vertebra of the

triplet) is then initialised to the best sub-model shape model fit to these (18) fixed

points. Note that the other vertebrae in the sub-model do not have any adjustment

made, although their standard reference points are temporarily set to be fixed for

the purpose of estimating the initialisation of the central vertebra. Note also that

the constraint weights associated with the alternative initialisation are always reset

to the low values of a prediction, even if the vertebra had already been provisionally

136


located as part of another sub-model’s search.

6.5.6.1 Results for multiple initialisations

The results of using the alternative initialisations are given in table 6.19, for both

triplet and quintet sub-structures. Interestingly when using the alternative fractured

initialisations the results for triplets and quintets are also much closer to each other

than when running with a single initialisation. There may be a small improvement,

particularly for the triplets. Any apparent improvement occurs mainly for the grade

3 fractures, with triplet sub-models. Here the mean error is reduced from 1.53mm

to 1.31mm and the percentage of edges misfit is reduced from 12.9% to 9.1%. The

improvement is less than that we reported earlier in [116], mainly because further

optimisation of the AAM (and perhaps also the inclusion of more fractured vertebrae

in the training set), had already improved the results for grade 3 fractures, even

without the alternative initialisation. The 98% confidence interval for the reduction

in mean error on grade 3 fractures is [0.061,0.322], and so this is significant at the

2% level. There is less apparent improvement when using quintets, though the mean

error for grade 3 fracures does reduce from 1.35mm to 1.28mm. The 98% confidence

interval for the mean reduction (on grade 3 fractures) is [-0.055,0.256], and so fails

to be statistically significant at the 2% level.

We conclude that when using quintet structures the increase in computational effort

in using multiple initialisations is not worthwhile, as the method is already very

reliable with only a single initialisation.

6.5.7 Discussion

6.5.7.1 Summary of AAM optimisation

The dynamic ordering of the sub-model fit improves the robustness and accuracy of

the search. The main improvements are in the tails of the error distributions, and

in the more difficult cases such as fractured vertebrae and their neighbours. On the

vertebral dataset, profile AAMs perform better than the classical triangulated region

AAM, even when the more sophisticated feature-AAM of Scott et al [123] is used

for the latter. A profile length of around 8mm in the lower lumbar works well (and

137


Sub-model Vertebra Search Error StatisticForm Status Mean Median 95%-ile %ge errors %ge edges


Triplets Fractured 1.00 0.67 2.22 12.2% 4.1%Grade 1 0.79 0.58 1.72 6.7% 2.4%Grade 2 0.97 0.67 2.18 12.1% 3.9%Grade 3 1.31 0.77 3.25 18.8% 9.1%Normal 0.73 0.55 1.55 4.7% 0.1%

Quintets Fractured 0.98 0.65 2.16 11.4% 3.9%Grade 1 0.76 0.59 1.60 5.1% 2.0%Grade 2 0.96 0.65 2.20 11.9% 4.3%Grade 3 1.28 0.78 3.00 18.3% 7.9%

Table 6.19: Search error statistics using alternative fractured initialisations

then scaling relative to vertebral height for the thoracic vertebrae).

We optimised the sub-model structure used, finding that groups of 5 vertebrae (quin-

tets) were marginally superior, though the performance gain over triplets was small,

and largely confined to severe fractures. We found that single vertebra models were

unreliable. At the other extreme we confirmed that a single global model was also

unable to accurately segment fractured cases, though with normal vertebrae its per-

formance was more similar to the sub-model approach. This somewhat contradicted

our earlier result obtained on a smaller training set [113]. It appears that with a small

training set there is more gain to be obtained from decomposing the structure into

multiple sub-models, but this accuracy difference is gradually eroded as the training

set is extended; but only for normal cases. The global AAM still maintains too

great an a priori bias towards the mean shape to cope with pathologies or unusual

sub-shapes (see Figure 6.10). We therefore recommend our sub-model approach be

considered in other applications.

6.5.7.2 Overall final accuracy

The accuracy results in Table 6.14 demonstrate that the performance of the multi-

AAM segmentation on normal vertebrae (mean 0.71mm) is comparable with manual

precision (0.55mm population, 0.81mm osteoporotic patients [57]). “Normal”means

not fractured, but some images in the dataset included disc disease, which can lead

138


to narrowed disc spaces and the endplates of two adjacent vertebrae being closer

together, so that there is no clear edge to separate individual vertebrae. Other images

include large osteophytes, which can confuse the positions of the vertebral corners.

Despite some difficult images, over 95% of points in normal vertebrae are located to

within 2mm. The overall accuracy is better than other comparable cited figures in

the literature [126, 143, 18, 77]. For example de Bruijne et al [18] obtained a mean

point-to-line accuracy of 1.4mm on lumbar radiographs using shape particle filtering

- a stochastic search algorithm which uses a distribution of candidate shapes. Our

overall mean point-to-line error is half that at 0.7mm, even though DXA images are

typically noisier than conventional radiographs and have poorer spatial resolution;

and we also include the mid-thorax, which has more soft-tissue clutter and a greater

probability of fracture. On the other hand the method of de Bruijne et al [18] is

completely automatic, whereas our system uses an approximate manual initialisation

on the vertebral centres. There may be future scope in combining approaches and

using a stochastic search, followed by our AAM based method for fine accuracy. This

might be more appropriate for very large pharmaceutical trials or epidemiological

studies, whereas in a clinical setting an approximate manual initialisation by the

viewing clinician is reasonable when this takes only a few seconds.

A decomposition of the accuracy results by individual vertebrae is given in Table 6.20.

The accuracy is consistently good over most of the spine, but with some deterioration

in accuracy on the upper and lower vertebrae (L4 and T7). This is probably because

the extreme vertebrae tend to be noisier, and L4 can be partially obscured by the iliac

crest. The precision is excellent at around 0.1mm for L2-T9, and then deteriorating

to be in excess of 0.2 at the extreme vertebrae (L4 and T8). This is still better than

typical manual precision.

The accuracy performance on fractured vertebrae is not as good as on non-fractured

vertebrae, but is still promising, with a median accuracy of 0.66mm, and mean error

1mm, which is comparable to manual precision. Fractured vertebrae present a more

challenging problem for an approach using statistical models, as the variation is much

greater for pathological cases than for normal vertebrae. As well as the greater vari-

ation in shape, the appearance is more complex, as in addition to the edge structure

provided by the endplate, there may also be a remnant of the outer ring of cortical

bone, and this can provide a secondary outer edge to confuse the search algorithm.

Approximately 88% of points in fractured vertebrae are located to within 2mm. The

accuracy on fractured vertebrae deteriorates with fracture grade (Table 6.14, Fig-

139


Search Error StatisticVertebra Mean Accuracy (mm) Mean Precision (mm)T7 0.88 0.21T8 0.75 0.16T9 0.72 0.l1T10 0.70 0.10T11 0.74 0.10T12 0.80 0.12L1 0.81 0.11L2 0.77 0.12L3 0.83 0.19L4 0.95 0.24

Table 6.20: Accuracy and Precision (mm) by individual vertebrae. The accuracy is themean point-to-line error averaged over the vertebrae. The precision is calculated similarlybut taking the mean of the 10 segmentations as the reference curve, and deducting onedegree of freedom when calculating the mean

ure 6.10), with the mean error increasing from 0.76 mm in grade 1 fractures to

1.35mm in grade 3. The error distributions are generally skewed, as a number of

search failures produce long tails to the distributions. The skewed error distribution

increases the mean error to 1.35 mm for grade 3 fractures, whereas the median is

only 0.8mm. Around 20% of points in grade 3 fractures have an error beyond 2mm.

We had initially suspected that the tail of the error distribution was substantially

due to undertraining of the shape model on fractures. However, the shape model

intrinsic accuracy results (Table 6.17) indicate that, although this plays some part,

the undertraining alone is not sufficient to explain the size of the error tail. For

example only 0.5% of points in grade 2 fractured vertebrae cannot be fitted at all to

within 2mm because of shape model inadequacies, and even with grade 3 fractures

only 2.4% would fail on this basis alone. The shape model is adequate up to grade

2 fractures, and although in grade 3 fractures 14.5% of points cannot be fitted at all

to within 1mm, the mean fitting error of 0.54mm is still adequate. The tail of the

fitting error for the grade 3 fractures should reduce with more training examples.

If inadequate training of the shape model does not substantially account for the fitting

failures on grade 3 fractures, then the most likely explanation is that the AAM search

locates some form of local optimum. A possible erroneous local optimum is where

the top of the vertebra is fitted to the bottom of the neighbouring vertebra above, or

the converse (Figure 6.8). The number of such misaligned edges increases from 2.8%

with grade 1 fractures to 9.4% with grade 3 fractures (Table 6.14). We attempted to

140


avoid these misaligned solutions by searching from multiple initialisations (starting

from deliberately fractured shapes), but although there was a marginal reduction

down to 7.9% of misaligned edges, and the grade 3 mean dropped slightly to 1.28,

overall this very slight improvement is probably not worth the additional compute

time. If the somewhat less robust triplet structures, or a less highly optimised AAM

were used, then it would become more worthwhile. Ultimately there is a limit set by

the fact that grade 3 fractures have low BMD, which results in a poor signal strength.

It is likely that this inevitably makes them very difficult to segment accurately. In

fact the manually annotated “gold standard” has itself become somewhat tarnished

due to difficulties in visualising the exact location of the edge of the vertebral body

at poor signal-to-noise ratio.

The AAM segmentation provides a detailed shape outline using between 32 and

40 points (Figure 6.3), and so provides much more information than is obtained

in current six point morphometry. In [32] it was shown that the use of a computer-

assisted technique to position tangential lines which define marker placements and the

vertebral axis produced better precision in vertebral height measurements, because a

degree of subjectivity in the placement of points along the endplate boundaries was

removed, and a more consistent definition of the mid-vertebral axis was also obtained.

We anticipate that, for the same reasons, the use of a detailed shape model rather

than just six standard points would also allow better precision in extracted height

information, when this is required for standard morphometric techniques.

The use of AAM based methods in a clinical tool would be much quicker than man-

ual morphometric methods. The tool requires an approximate manual initialisation

(on the centre of each vertebra), but this takes only around 10 seconds per image

to perform, and need not be precise. The automatic algorithms then locate all the

vertebral endplates in around 30 seconds, rather than the minutes (typically 5-15 min-

utes) that would be required to perform manual 6-point placement. This algorithm

run-time could be improved on multi-core processors as the dynamically sequenced

search could be multi-threaded (each candidate AAM search is independent).

It is a limitation of the study that the more modern finer resolution (0.35m) data was

not used to its full potential, as the presence of a substantial portion of older data led

us to construct appearance models appropriate to the worst case (1mm) resolution.

As we acquire more modern data we will be able to build finer scale models that would

be expected to yield better accuracy. We are also currently extending the modelling

141


up to T4, as the more modern DXA scanners tend to produce better images of the

upper thoracic vertebrae than in some of the older data used in the study.

The advantages of sub-model decomposition are greater with small training sets, and

with a large enough training set may become more marginal and apply mainly to

unusual examples (e.g. pathologies). However in medical applications when even

small changes in mis-classifications of diseases can be of importance to patients,

even a modest improvement in handling pathological cases can be of real benefit.

Furthermore the sub-model approach should offer a greater improvement in imaging

modes (e.g. radiographs) where varying projection or magnification across the image

means that there are advantages to locally varing the pose parameters.

6.5.8 Conclusions

We have assessed the location accuracy of multi-AAM segmentation for vertebral

assessment on DXA images, using a large training set containing a substantial number

of fractured vertebrae. The accuracy achieved on normal vertebrae is good and is

comparable to manual precision, whilst the automatic precision is substantially better

than by a manual method at under 0.2mm. The accuracy performance of the tool

does deteriorate with increasing fracture grade (Figure 6.10), but even in fractured

vertebrae, results of acceptable accuracy are achieved in almost 90% of cases. The

shape models are adequate up to grade 2 fractures, and for over 87% of points in even

grade 3 fractures. The feasibility of substantially automating vertebral morphometry

measurements on DXA images is confirmed, even with multiple fractures present.

However a small amount of manual correction would be necessary, mainly for the

more severe fractures.

142

Chapter 7

Vertebral Fracture Classification

using Shape and Appearance

Parameters

7.1 Introduction

It is known that current quantitative methods of detecting vertebral fractures based

on height ratios are insufficiently specific, particularly in distinguishing mild (grade

1) fractures from other kinds of short vertebral height deformities. This has led to

the Genant semi-quantitative method [65] becoming almost a de facto gold-standard

for fracture assessment, but there still remains significant subjectivity, particularly

for mild (grade 1) fractures. The subjectivity problem is discussed at length by Jiang

et al, who proposed an Algorithmically Based Qualitative diagnosis method (ABQ)

[79]. Further discussion has already been given in chapter 3. In addition to the issue

of subjectivity, another problem is the inadequate number of radiologists in some

countries to interpret radiographs. Furthermore, newer scanning technologies (DXA)

are becoming available in units other than radiology departments. Therefore it is

desirable to define a quantitative approach which can capture at least some of the

more subtle information used in expert visual assessment. Our aim is to define more

reliable quantitative fracture classification methods based on a complete definition

of the vertebra’s shape, and the texture within a sampling profile centred on the

endplate. Some similar work using shape parameters had been previously done by

143

Chapter 7. Vertebral Fracture Classification using Shape and

Appearance Parameters

Smyth [126] on a smaller dataset of lumbar vertebrae. Because in practise radiol-

ogists do not classify only using shape, but also employ appearance clues involving

the texture around the endplate, and the presence or otherwise of other differen-

tial indicators (e.g. disc problems, osteophytes), we also investigated using the full

appearance model parameters. These encode both shape and texture information.

Linear classifiers were constructed using shape and appearance parameters and com-

pared with more standard morphometric methods.

We have already published much of the contents of this chapter in a journal paper

[118]. The results presented here are slightly different, due to some further optimisa-

tion of the classifiers, and also some subsequent changes in the dataset, though the

conclusions are not affected in substance.

7.2 Classification Methods

7.2.1 Data and Ground Truth

The same dataset was used as for the segmentation experiments of the previous

chapter, that is 360 DXA images. Initially we trained and tested the classifiers using

the manually segmented points (the “gold standard”). This was to explore classifier

performance unconfounded by AAM segmentation errors. Later in this chapter we

present results for classifiers still trained on the manual segmentations, but tested

against the automatically determined shapes (the AAM solutions of the previous

chapter).

The vertebrae were first independently classified by two radiologists (JEA,EP) us-

ing the Algorithmically Based Qualitative (ABQ) method [79, 54]. This method

emphasises the collapse of the endplate∗ as the fundamental visual indicator that a

deformed vertebra really is a fracture. This typically produces a more diffuse texture

around the endplate. In fractures it is often possible to also see the remnant of the

peripheral intact cortical rim of the vertebral body, which in lateral view appears as

a secondary edge above the endplate itself. This differs from the typically crisper

vertebral edge associated with (non-fracture) short vertebral height deformities (e.g.

∗most commonly the mid-portion of the endplate

144



mild wedging caused by spondylosis). The two radiologists then compared readings,

and where there was a discrepancy they reached a consensus judgment. If a vertebra

was judged as fractured, it was assigned a grade according to the Genant grading sys-

tem [65]. Vertebrae were classed as either: Normal, Deformed (but not fractured),

Fractured (with sub-grades in the Genant system), or Not Visualised. Note that

those vertebrae classed as deformed all display height loss of in excess of 15% and

may be confused with mild fractures; often these displayed mild wedging due to disc

disease. The 79 vertebrae which could not be adequately visualised were ignored in

the later analysis. There were 354 fractures (97 mild, 147 moderate, and 110 severe)

and also 158 non-fracture vertebral deformities, and so there was considerable scope

for confusion between mild fractures and non-osteoporotic deformity.

Note that the “gold standard” data used here was slightly different to that previously

used in our earlier publication [118], as some of the spines in a subset of the data

had originally been incorrectly annotated by one vertebral level (usually L3 had

been marked as L4). This affected 53 images. These spines were shifted and the

supplementary vertebrae (usually T7) annotated and classified. In a few cases I

suspect that some confusion may have been introduced as to the vertebral levels

reported by the radiologist. This would lead to the results in [118] being slightly

worse than they should be. The affected spines were therefore completely re-classified

by the two radiologists and so the results reported here differ slightly from those

in [118] (overall performance is slightly better). However the conclusions are not

materially affected.

7.2.2 Linear Classifiers - Inputs and Training Scheme

Although vertebrae at different levels have subtly different shape distributions, there

was not enough data to have a separate classifier for each vertebral level. The data

was pooled into three shape models (and hence three classifiers) for: the lumbar

spine (L4 to L1), lower-thoracic spine (T12-T10), and mid-thoracic spine (T9 to

T7). For each of these sections a point distribution model (PDM) was derived as

follows. Firstly, the vertebral shapes were recursively aligned to the group mean

shape, using generalised Procrustes Analysis to remove translational, rotational and

isotropic scaling from the shape. Then the remaining variance around the mean shape

was modelled using principal components analysis (PCA) to extract the eigenvectors

of the covariance matrix associated with 98% of the remaining point position variance

145



(ordered by largest eigenvalue - i.e. highest variance component - first). This follows

the standard method for deriving a linear shape model as discussed in chapter 4, and

results in a substantial reduction in dimensionality. These eigenvectors then define a

basis for the space of allowable shapes, and the shape parameters of a vertebra are

simply the basis coefficients required to reconstruct its aligned shape. A particular

vertebral shape is defined by a vector of points x, which is projected into the shape

model space so:

x ≈ T (x + Psb), (7.1)

where xm is the mean aligned shape (in the model frame), Ps is the shape model

orthogonal basis matrix consisting of the column-wise concatenation of the eigenvec-

tors, b is the vector of shape parameters, and T is a similarity transform representing

the shape’s pose parameters (positional offset, isotropic scaling and rotation).

Once the shape models were derived, the appropriate shape model was fitted to each

vertebra in the training set, and the resultant shape parameters were recorded. Note

that in the model frame, the eigenvectors are normalised, with absolute size implicit

in the pose parameters of T . Thus the shape parameters are scale-free, and so might

fail to determine crush fractures where there is simply an overall height reduction,

with little other change in shape. So in addition a scale measure was also used in the

classifier. This was taken from the McCloskey method [95], and was the ratio of the

actual posterior height Hp to the predicted height Hp(pred) given the posterior height

of four neighbours, with the predicted height calculated as in [95] (see also Chapter

3 for a summary of the McCloskey method). The expected ratios of vertebral heights

for the McCloskey predicted height were based on a trimmed population of normal

reference data for DXA images taken from [111]. We refer to the ratio of actual to

predicted posterior height below as the crush ratio with

rcrush =Hp

Hp(pred)

(7.2)

We then trained a linear discriminant. This was done by assigning categorical values

of -1 to non-fractured vertebrae —(normal or deformed) and +1 to fractured verte-

brae. We combined these into an overall category vector y. To handle mean offsets

in the regression model, let the shape parameters bi of image i be extended with an

146



additional -1, to create parameters viT =

[rcrush,bi

T ,−1]T

and denote the row-wise

concatenation of vi by V . The regression model weights are then given by:

w =(VTV

)−1Vy (7.3)

An appearance-based classifier was also derived in a similar manner but using the

parameters of an appearance model [26] rather than simply shape parameters. In

summary the texture used in the appearance model was extracted by sampling along

normal profiles centred on each shape model point. After a non-linear renormalisa-

tion to the local image statistics (see next section), a texture model can be derived in

an essentially similar manner to the shape model, by performing PCA. Finally corre-

lation between shape and texture can be modelled by concatenating both shape and

texture model parameters into a combined vector and performing a tertiary PCA.

This produces the appearance model, whose parameters determine both shape and

surrounding texture. See chapter 4 for further details.

Our classifiers are simply binary (fractured or not). If a vertebra is subsequently

classified as fractured then its grade can be assigned by conventional methods of

height reduction thresholds.

For a comparison with what might be achieved in a similar way using the more

standard morphometric height ratios we also trained a linear discriminant using the

crush ratio, and the mid-height and wedge ratios:

rmid =Hm

Hp

rwedge =Ha

Hp

, (7.4)

where Hm, Hp are the middle and anterior heights respectively. As a second baseline

comparison we used a hybrid morphometric method in the more standard manner,

using these same ratios: the Eastell-Melton method [45] for the mid-height and wedge

ratios, and the McCloskey method’s [95] crush ratio . We calculated the mean and

standard deviation of each of the three ratios over the normal population (excluding

any vertebrae classified as deformed). No sample trimming was performed, as we

considered that in effect this had already been done by the radiologists’ classification

of certain vertebrae as deformed. A vertebra was then classified as fractured if any

of the three ratios was less than X standard deviations below the normal mean. X

was varied to generate derived Receiver-Operating-Characteristic (ROC) curves. We

147



refer to this method below as Eastell-McCloskey hybrid.

7.2.2.1 Height Calculation

In order to calculate vertebral heights it is necessary to define a vertebral axis line.

We use the method of Felsenberg and Kalender [53]. This also helps correct for lateral

displacement of the morphometric points, as only separation distance perpendicular

to this axis is used. The axis is defined as the angular bisector of the two (superior and

inferior) lines joining the respective posterior/anterior corner points. More precisely,

two lines L1,L2 are constructed: L1 is the line joining the superior posterior to

superior anterior points; and the inferior posterior to anterior points defines L2. Then

we calculate the point xI where these two lines intersect; or if a parallel check is

satisfied the algorithm takes the simpler case of the parallel line in the middle of

the (parallel) line pair. Assuming the more standard former case, we next calculate

the line LV passing through xI which bisects the angle between L1,L2. This defines

the vertebra’s axis. It is actually the normal L′V to this axis which defines the line

along which heights are measured. Thus for example, with the obvious notation

for superior and inferior posterior coordinates, and having normalised L′V to have a

length of unity:

Hp = | (xsp − xip) .L′V | (7.5)

And likewise for the other point pairs, mutatis mutandis.

7.2.2.2 Appearance Model Form

The models used for classification were similar profile models to the ones previously

used for segmentation. Some pre-smoothing of the image was performed, and the

non-linear renormalisation was also slightly different. We experimented with using

simply the image texture (but renormalised), and also gradient along the profile.

Sampling profiles were defined normal to the vertebral shape tangent at each point

in the shape model, with a nominal step length depending on the vertebral type

(lumbar, lower-thoracic, mid-thoracic), varying from 1mm in the lumbar, 0.9mm

for the lower-thoracic vertebrae (T12-T10), and 0.75 for the mid-thoracic vertebrae

(T9-T7). The local image patch was pre-smoothed in a 5-tap radially symmetric

148



Gaussian filter with a standard deviation of 0.7 times the step length. These sizes

are strictly applied to the mean shape in the class, and the actual step length is

renormalised based on the ratio of vertebral size to mean size, where the size metric

is the root mean square distance from the boundary to the centroid. For the gradient

APMs, the gradient along the profile was then extracted using a 1x3 Sobel filter,

introducing some additional smoothing across the profile. For texture models a smilar

smoothing across the profile was intoduced by sampling one step either side of the

current position and convolving with the Sobel weights [0.25, 0.5, 0.25]T . The profile

lengths were 5 steps inside the vertebra and 8 steps outside. A slightly larger sample

was taken outside because the shape follows the collapsed endplate in the case of a

fracture, and the larger external sampling distance ensures that where the endplate

has collapsed by up to approximately 40% in height, but a cortical rim remains

above, the latter can be sampled by the outer profile. The ensemble of sampled

vectors was then renormalised to the local image statistics using a mapping based

on the signed square root of the Geman-McClure robust error function [61], which

applies a standard normalisation to (-1,1) whilst tending to truncate any extreme

values. See equation 6.2 in chapter 6. The normal use of the Geman-McClure

function is as an M-estimator in robust statistics, but we have used it here as an

extension of the sigmoidal normaliser used by Cootes et al ( [32], see also equation

4.27). The advantage of the Geman-McClure form over that of Cootes et al is that

scale is explicitly incorporated, and the use of median statistics helps avoid intoducing

too much “signal” into the estimate of background. Thus significant structure should

be towards the end of the function range, whereas background and noise should be

mostly confined to the [-0.5,0.5] range. This normaliser devotes a greater proportion

of the overall range to significant structure than that proposed by Cootes et al [32].

Also it has a more natural extension to pure grey-level AAMs, rather than gradient

or other feature detector AAMs with zero background expectation.

When concatenating the shape and texture parameters to produce the vectors used

to train the appearance models the weighting scheme was to equalise the variance of

the shape and texture sub-components.

149



7.3 Experiments

7.3.1 Initial APM form selection

When training classifiers it is generally necessary to select a subset of relevant features.

Often leave-one-out jacknife tests can be performed to select relevant features (e.g. by

greedily introducing a new feature that most improves the ROC curve area). However

in our cases the features are shape and appearance parameters which themselves

depend on the training subset, and so it is not possible to do this. We experimented

instead with performing feature selection by means of a stepwise regression process.

We start with a forwards stepwise process which only selects shape or appearance

parameters that significantly improve the categorical regression’s residual sum-of-

squares. At each iteration the most significant (on a F -ratio) feature variable is

selected until no more significant variables remain. We also run a backwards process

that starts with all parameters included, and gradually removes variables that do

not make the residual sum of squares significantly worse. At each iteration the

variable with minimum associated F -ratio is removed. Next we take the intersection

of variables from both forwards and backwards processes, and run another forwards

process; similarly we take the union of both and run a secondary backwards process.

The final variable set is the union of both of these parameter sets.

A closely related question is what proportion of the total variance of both shape and

texture models to include. We need to include enough shape variance to be able to

accurately fit to unseen shapes, so we simply fixed the shape variance proportion

a priori at 98%. We also noted that the stepwise regression included most shape

parameters (typically reducing for example from 18 dimensions to 15), so the shape

parameters are not hugely redundant. However the texture model is more question-

able, as the noisy nature of the images means that there is a risk of including a large

number of essentially meaningless “noise” modes. We also noted that the stepwise

regression would typically halve the number of included appearance parameters. We

therefore decided to conduct some preliminary experiments to try and optimise the

amount of variance included in the texture model. We could have run several leave-

one-out experiments over our dataset, and then picked the appearance model form

giving the best performance. But this would be liable to be a biassed estimate of

the true population performance, as the same experiment is used to optimise and

test. So instead we performed a coarser level optimisation of the texture model by

150



performing a bootstrap estimate of performance as follows.

We conducted two levels of randomisation. At the first level we randomly pick 32

images to be excluded, and in the second we randomly order the remaining dataset,

and loop through it using a leave-16-out train/test cycle. We conducted 50 cycles

of the outer exclude-32 loop. We then calculated ROC curves for the concatenated

bootstrap over all such randomisations. Note that this will tend to under-estimate

performance compared to an overall leave-one-out, as at each classification the models

and classifiers are somewhat less well trained (as 48 images are removed rather than

just one). We stepped through the texture variance included from 50% to 95% in 5%

steps, examining both classifiers with all parameters in, and performing a selection

using stepwise regression.

As each leave-16-out train/test instance produces in fact a slightly different set of

classifiers to all others, the derivation of Receiver-Operating-Characteristic (ROC)

curves is not completely obvious. In order to concatenate the results into single

curves, for each classified vertebrae in the test image we recorded the linear classifier’s

measure of the likelihood ratio that it was fractured, given by:

Li(frac) = (1 + exp (w.vi))

−1 (7.6)

Then in deriving concatenated ROC curves we pool all these likelihood ratios and then

obtain a particular overall performance by imposing a specific detection threshold.

We derived ROC curves by varying the detection threshold used.

Initially we intended to pick the texture model form and variance proportion that

maximised the area under the ROC curve. However we found that ROC curve area

was not a very discriminating measure. All values examined were similarly high

(around 0.98), and varying the variance proportion just tended to move around which

section of the ROC curve was best, rather than producing an overall optimum. There

was a tendency for the models with higher variance to be better at sensitivities in the

85%-95% range, but these often had long final tails and produced poor performance

over a few atypical examples, resulting in lower area over the final portion of the

ROC curve. This would be typical of over-complex models which are undertrained.

Which model form is the best depends really on the portion of the ROC curve around

which one wishes to operate. We knew from provisional results that the 90%-95%

sensitivity region would be a feasible operating region for an appearance classifier,

151



whereas higher sensitivity (e.g. > 97%) would be likely to produce unacceptably high

false positive rates. We chose to optimise performance around the 95% region.

There is clearly a danger of an optimisation which is based on only one point on a

ROC curve, which will be subject to sampling noise, increasing in the high sensitivity

region. Therefore we took a blurred version of the false positive rate in the [90,97]%

sensitivity region. This was done by convolving the false positive rates in this re-

gion with a beta distribution, with mode 0.95, and parameters selected to effectively

focus the beta PDF in the skewed [90,97]% region. A beta kernel was used rather

than a Gaussian for several reasons. Firstly we needed an asymmetric distribution

which could be skewed, because there is a natural asymmetry to the false positive

rates around 95% - these very rapidly increase as one moves to higher sensitivites.

Secondly the beta distribution has a finite domain on [0,1], and thirdly it is com-

monly used as a conjugate Bayesian prior in problems involving binomial (Bernoulli)

distributiions. Beta distributions are also commonly used to model events which are

constrained to take place within an interval defined by a minimum, maximum and

a mode; for example in project planning methodologies such as PERT† which use a

mode, optimistic and pessimistic time scale estimate for each task.

The beta distribution has a PDF given by

fB(x) =xα−1(1 − x)β−1

B (α, β)(7.7)

where the normalisation constant B (α, β) is the Beta function given by

B (α, β) =Γ(α + β)

Γ(α)Γ(β)(7.8)

and Γ is the standard gamma function.

The parameters α, β can be estimated from the first two moments of a sample by

inverting the equations for mean and variance to obtain, defining γ = µ(1−µ)σ2 − 1:

α = µγ β = (1 − µ)γ (7.9)

where µ, σ are the mean and standard deviation. These can also be estimated from

†Program Evaluation and Review Technique, originally developed for the management of thePolaris missile project

152



the desired effective range and mode according to:

µ ≈ a + 4b + c

6σ ≈ c − a

6(7.10)

where a is the minimum, b is the mode, and c is the maximum. Thus (subsitituting

a = 0.90, b = 0.95, c = 0.97) we obtain µ = 0.945, σ = 0.01167. We then loop over

all points on the ROC curve where the sensitivity changes between 0.90 and 0.97

and convolve their associated false positive value with this beta kernel, and then

normalise. Thus the convolved false positive rate is given by

rfp =

∑i∈I fB(si)rfp(si)∑

i∈I fB(si)(7.11)

where rfp(si) is the false positive rate at sensitivity si, and I is the set of indices

where si ∈ [0.90, 0.97] and si 6= si−1.

The texture variance proportion to use was then selected for each of the three spinal

regions using the best performing value of rfp. Results are given in the next section.

For now we note that rather different texture variance proportions were used (0.775

for mid-thoracic, 0.675 for lower thoracic, and 0.825 for lumbar). Also gradient models

worked best in the thoracic spine, whereas raw grey level‡ worked best in the lumbar.

Final leave-one-out cross-validation tests were then performed over the entire dataset

with texture models thus determined. Again for each classified vertebrae in the test

image we then recorded the linear classifier’s measure of the likelihood ratio that it

is fractured, given equation 7.6 above.

Similarly for the Eastell-McCloskey method we calculated the area in the tail of

the normal vertebra’s CDF Fn (i.e. the probability that a normal vertebra would

have a height ratio less than or equal to the given one), and then recorded the

“likelihood” of a fracture as the complement of this: 1 − inf{Fn}. Although this is

not strictly a likelihood ratio (it is not well-normalised) the use of the minimum is

logically equivalent to performing an OR operation over all three height ratios in the

original method, and fixing a standard deviation threshold is logically equivalent to a

threshold on the normal population CDF. We have reformulated the method in this

way as it makes it easier to combine data into an overall ROC curve in the same way

as the likelhood ratios from the discriminants.

‡but renormalised using the Geman-McClure function

153



The statistical significance of classifier differences was investigated using McNemar’s

test [98]. This is applied to single points on the ROC curves [144], and can be used

for pairwise comparisons of either the specificities for given sensitivity, or vice versa.

We compared the specificities of a set of operating points in the 95% sensitivity

region (from 93% to 97%) for the following pairwise comparisons: height ratio vs

shape linear discriminants; height ratio vs appearance linear discriminants; shape

vs appearance linear discriminants; hybrid Eastell-McCloskey vs height ratio linear

discriminant. The test is applicable to matched samples (i.e. same image set) with

binary labels. To compare specificities at given sensitivity, we compared the number

of AB and BA pairings, where an AB result is one where classifier 1 gives the correct

true negative, but classifier 2 gives a false positive; and vice versa for a BA pairing.

Instances where both classifiers give the same result have no influence on the test

statistic. McNemar’s test statistic is given by:

(|Nab − Nba| − 1)2

Nab + Nba

(7.12)

where Nab and Nba are the number of AB and BA pairings as defined above. This

would asymptotically follow a chi-squared distribution with one degree of freedom

(valid for (Nab + Nba) ≥ 10), under the null hypothesis that the classifiers are equiv-

alent apart from random error. We compared a set of points around the desirable

operating region rather than for example ROC curve area, because statistics for the

whole curve area generally involve operating regions of little practical relevance (e.g.

unrealistically low sensitivities or high false positive rates).

We also evaluated the per-patient (rather than per-vertebra) sensitivity and false

positive rates for a range of underlying per-vertebra specificities. Here we assume

that a patient is truly osteoporotic if any vertebral fractures at all are present, and

normal otherwise. Then, for the underlying per-vertebra operating specificity, we

check whether the patient is diagnosed as having any fractures at all by the combined

classifiers applied to all the vertebrae. This enables us to produce a set of patient-

level (really image-level) false positive and sensitivity figures. Osteoporotic patients

will often have several vertebral fractures, and so it need not always matter if say

a grade 1 fracture is missed in a patient with other fractures that are recognised.

On the other hand per-patient false positive rates will be higher than the underlying

single-vertebra false positive rates. We examined these patient-level statistics for

single vertebra false positive rates of 2.5% and 5%.

154



7.4 Results

Table 7.1 shows the beta-convolved false positive rate in the 95% sensitivity region

as a function of the variance proportion retained in the texture model, for a gra-

dient based texture model; and Table 7.2 shows the corresponding figures using a

renormalised grey-level intensity model. These results are for the initial bootstrap

experiments, and were used to select the texture model to be used. It can be seen

that gradient models give better performance in the thoracic spine, whilst grey-level

models seem better in the lumbar. The reasons for the difference are not obvious, but

may be to do with the tendency for the lumbar region to have higher unstructured

noise due to a significant number of obese patients. As DXA images use low energy,

fat in the lumbar region can absorb a substantial portion of this photon flux, lead-

ing to poor signal to noise ratio with obese patients. As the variance of a gradient

is larger than the underlying grey level (typically doubled), it could be that under

higher noise conditions some advantages of gradient-based normalisation are lost due

to the increase in variance. Referring to table 7.1, there is generally a region of better

performance, and we selected variance proportions towards the upper end rather than

the lower. This is because we also experimented with a stepwise regression process as

discussed earlier, so we retain a further opportunity to discard useless information;

whereas there is no way to add it once it is excluded from the initial model. Also

as the second set of experiments should be somewhat better trained it is better to

err on the side of including some additional feature information that could yet be

useful. We selected variance proportions in the middle of the 75-80% region for the

mid-thoracic spine, and the 65-70% region for the lower-thoracic. It is interesting

that the optimal proportion for the lower-thoracic region seems smaller than for the

mid-thoracic. Somewhat different sampling lengths are used (smaller from T7-T9)

so this may be part of the explanation. Artefacts from diaphragm motion also tend

to affect T10-T11§. Referring now to the grey-level model results in Table 7.2, we

selected the mid-point of the 80-85% region for the lumbar texture model. The grey

level texture models tend to have significantly fewer modes than gradient based ones,

so it makes sense for the retained variance to be somewhat higher for the lumbar

than the thoracic spine.

Figures 7.1, 7.2 and 7.3 show the ROC curves for the mid-thoracic, lower thoracic

and lumbar spine respectively, derived from the final set of leave-one-out experiments.

§due to the long DXA scan time of around 5 minutes

155



Each figure gives ROC curves for the shape parameter classifier, appearance param-

eter classifier, and baseline Eastell-McCloskey hybrid classifier. The ROC curves

display the most interesting section of the ROC curves (sensitivity exceeding 0.6,

FPR below 0.5). We show the effect of fracture grade in Figures 7.4 and 7.5 ,

which present similar ROC curves for grade 1 and 2 fractures respectively. No ROC

curve is given for severe fractures, as all classifiers are virtually perfect against the

grade 3 fractures. The areas under the curve for the various ROC curves are given

in Table 7.3, whilst Tables 7.4, 7.5 and 7.6 show the false positive rates obtained

with the various classifiers at sensitivities of 0.90,0.95 and 0.98 for the mid-thoracic,

lower-thoracic, and lumbar vertebrae respectively.

The McNemar test statistics for the specificity comparisons around the 95% sensitiv-

ity operating point are given in Table 7.7. It can be seen that for most comparisons

between a shape model classifer and a height ratio linear discriminant, there are

hugely significant differences; and even more so for the comparison betwen an ap-

pearance model classifier and the height ratio linear discriminant. The exception is

in the lumbar spine, where use of a shape model discriminant is not significantly

better than a height ratio discriminant. The appearance model classifier significantly

outperforms the shape model classifier in every case but one. In the comparison be-

tween the two height based methods (linear discriminant vs Eastell-McCloskey) there

is no consistent overall difference; though one may outperform the other at particular

operating points.

A visual representation of the optimal shape discriminant direction for mid-thoracic

vertebrae is shown in Figure 7.6, which displays how the shape varies in the dis-

criminant direction, starting from a normal vertebra, then moving to the mean shape

and adjusting the shape parameters through the classification boundary and beyond.

It appears from Figure 7.6 that biconcave endplate fractures dominate, although a

modest degree of wedging does also start to appear. Note that Figure 7.6 shows

only the variation in scale-free shape, and absolute height (or McCloskey height ra-

tio) is not indicated on the Figure. Similarly Figure 7.7 shows the variation in

the synthesised appearance moving along the discriminant direction from the mean

appearance.

The results presented are all for classifiers using the full set of model parameters,

given the proportion of texture variance retained. As discussed above we also exper-

imented with using stepwise regression to try and obtain a more parsimonious set of

156



Beta-Convolved FPR in Spinal Region(%)Texture Variance(%) Mid-Thoracic Lower-Thoracic Lumbar

50 7.1 10.3 9.055 7.9 10.6 9.960 7.0 9.9 9.965 5.9 7.2 10.370 5.1 7.6 9.575 4.6 9.4 8.680 4.8 9.8 9.185 5.0 10.1 9.490 7.7 10.1 9.6

Table 7.1: Beta-convolved false positive rates (%) for the gradient appearance classifieras a function of variance retained in texture model

Beta-Convolved FPR in Spinal Region(%)Texture Variance(%) Mid-Thoracic Lower-Thoracic Lumbar

50 9.1 10.3 8.255 9.4 10.1 8.460 9.3 10.1 8.865 7.2 9.9 7.970 8.4 10.3 7.775 7.9 11.0 6.580 8.1 10.0 5.985 8.0 11.1 6.290 8.6 10.6 7.595 9.5 11.5 7.9

Table 7.2: Beta-convolved false positive rates (%) for the intensity appearance classifieras a function of variance retained in texture model

parameters. Although in provisional results using larger proportions of texture vari-

ance (92%), this resulted in some improvements in certain cases, once the retained

texture variance had been optimised, we found that the stepwise selection process

did not result in any further improvement in appearance model results, and indeed

could even degrade performance. Further results are therefore omitted.

In [118] we had also presented some results using a modified classifier training

method, which used a form of robust statistics to fit the classifier weights. In essence

this downweighted the influence of cases far away from the classification boundary

(e.g. severe fractures). This had appeared to produce some slight improvements, but

again after further optimisation of the texture models, we found that this more com-

157



Spinal Region Fracture GradeClassifier MT LT Lum G1 G2 G3 AllEastell-McCloskey 0.9631 0.9674 0.9733 0.9248 0.9800 0.9947 0.9690Height Ratios LD 0.9691 0.9718 0.9749 0.9228 0.9846 0.9994 0.9718Shape LD 0.9782 0.9854 0.9804 0.9435 0.9924 0.9998 0.9810Appearance LD 0.9884 0.9848 0.9798 0.9490 0.9955 0.9998 0.9838

Table 7.3: Area under ROC curves. Columns labelled MT, LT and Lum refer to mid-thoracic (T7-T9), lower-thoracic (T10-T12) and lumbar (L1-L4) vertebrae respectively;those labelled G1, G2, G3 refer to fracture grades 1, 2 and 3 respectively.

Figure 7.1: Mid-Thoracic Spine (T7-T9) ROC Curves showing the Eastell-McCloskeyheight classifier and the shape and appearance model linear discriminants

Sensitivity (%)Classifier 90 95 98Eastell-McCloskey 6.7 20.6 27.6Height Ratios LD 2.8 19.9 26.9Shape LD 3.4 8.2 20.9Appearance LD 2.4 3.2 7.6

Table 7.4: False Positive Rates (%) in the mid-thoracic spine (T9-T7) for the differentclassifiers at various sensitivities.

158



Figure 7.2: Lower-Thoracic Spine (T10-T12) ROC Curves showing the Eastell-McCloskeyheight classifier and the shape and appearance model linear discriminants

Figure 7.3: Lumbar Spine ROC Curves showing the Eastell-McCloskey height classifierand the shape and appearance model linear discriminants

159




Table 7.5: False Positive Rates (%)in the lower-thoracic spine (T12-T10) for the differentclassifiers at various sensitivites


Table 7.6: False Positive Rates (%)in the lumbar spine for the different classifiers atvarious sensitivites

Figure 7.4: ROC Curves for combined Grade 1 Fractures showing the Eastell-McCloskeyheight classifier and the shape and appearance model linear discriminants

160



Figure 7.5: ROC Curves for combined Grade 2 Fractures showing the Eastell-McCloskeyheight classifier and the shape and appearance model linear discriminants

McNemar Test Statisticat Sensitivity Level:

Spinal region Classifier Comparision 93% 95% 97%T7-T9 Height LD vs Shape LD 37.4 83.8 19.5

Height LD vs Appearance LD 54.9 127.7 138.7Shape LD vs Appearance LD 10.8 33.0 93.6E-M Height vs Height LD (1.6) 0.3 0.2

T10-T2 Height LD vs Shape LD 91.0 93.0 14.7Height LD vs Appearance LD 99.0 96.6 40.7Shape LD vs Appearance LD 3.2 6.5 15.8E-M Height vs Height LD (55.2) 1.3 15.0

L1-L4 Height LD vs Shape LD (8.2) 2.6 3.7Height LD vs Appearance LD 26.7 42.0 14.0Shape LD vs Appearance LD 65.2 37.6 7.4E-M Height vs Height LD 5.6 (0.9) (8.0)

Table 7.7: McNemar Test Statistic comparing various classifiers between 93% and 97%sensitivity. E-M Height refers to Eastell-McCloskey, whereas Height LD is the linear dis-criminant using the 3 heights. Note that the 5% significance level of the χ1

2 distribution is3.84, and large values above this indicate significant differences. Bracketed figures indicatedeterioration from 1st to 2nd classifier.

161



Figure 7.6: Visualisation of the (scale-free) discriminant direction in shape parameterspace for a mid-thoracic vertebra. Shapes are generated by adjusting the shape parametersfrom the mean shape along the optimum discriminant direction in units of half the distancefrom the mean to the classification boundary. Counting left to right: 1) the shape movedone step into the normal region; 2) the mean shape;3) a slightly deformed vertebra halfwayto the boundary; 4) the boundary shape; 5-7 fractured vertebrae lying 1,2,3 steps into thefractured region. Note these are scale-free shapes with no account of absolute height.

plicated method produced no consistent improvement. Finally we also experimented

with fitting the appearance model using a robust M-estimator using the Geman-

McClure kernel as in [13]. This produced no improvement, though that might be

due to the fact that the radiologists discarded certain cases as “not visualised”. In

practise it could still be worthwhile fitting the models using a robust estimator, to

avoid for example false positives due to diaphragm motion artefacts. But in this the-

sis we have presented results only for standard least squares methods, which appear

to work well with this data.

The false positive rates and associated sensitivities at the patient level (i.e. combining

all vertebrae) are summarised in Table 7.8. Note that these figures reflect the fracture

prevalence within our dataset, which was deliberately fracture-enriched. Patient-level

results cannot be generalised to a more general population, but are included as a rough

guide to how vertebral-level specificities and sensitivities may translate to the overall

patient level.

162



Figure 7.7: Visualisation of the (scale-free) discriminant direction in appearance param-eter space for a mid-thoracic vertebra. Appearances are generated by adjusting the shapeparameters from the mean shape along the optimum discriminant direction in units of 0.25the distance from the boundary to a borderline grade 3 fracture. The top left appearance isthat of a normal vertebra two units from the boundary, then on the right is a normal verte-bra one unit from the boundary. The middle appearance represents the boundary betweennormal and fractured veretebra. The lower two appearances represent moderate (left) andsevere fractured vertebrae (right) placed at two and four units from the boundary. Notethat the appearance model displays the renormalised smoothed image gradient along thesampling profiles. Also note these are scale-free shapes with no account of absolute height.

163



Classifier Vertebral Patient PatientType FPR (%) FPR (%) Sensitivity (%)Eastell-McCloskey 2.5 15.5 92.6Shape LD 2.5 12.4 96.3Appearance LD 2.5 10.8 97.2Eastell-McCloskey 5.0 21.6 95.4Shape LD 5.0 18.6 97.2Appearance LD 5.0 18.6 98.1

Table 7.8: Overall Patient-Level FPR and Sensitivity given individual vertebrae FPR

7.5 Discussion

Firstly it is interesting that even the height-ratio based methods performed better

than might have been expected. For example Li et al [88] reported a 60% sensitivity

and FPR of 0.85%, when applying a 3SD threshold to height ratios; or when using

a 2SD threshold in the same study, a false positive rate of 8% pertained, with 80%

sensitivity. At similar specificities we obtained a 68% sensitivity (FPR of 0.85%),

and 92% sensitivity with FPR of 8%; whereas our false positive rate at 80% sensi-

tivity was only 2.2%. However most studies on morphometric methods have used

radiographs, where apparent shape change due to projective tilting may be more li-

able to induce false positives than in DXA. We have only considered vertebrae up to

T7, and some vertebrae have been excluded because the radiologists defined them as

inadequately visualised. Also we use a relatively complex height calculation which

normalises for lateral displacement of points, and the set of image processing facil-

ities and model-based interpolation inherent in our manual markup probably lead

to better precision in the underlying point placement. But it is interesting that our

DXA results appear to be considerably better than earlier studies using radiographs.

This provides some evidence that DXA images may actually be better suited than

radiographs for morphometric methods, though one has to accept a small proportion

of unvisualised vertebrae, and the technique would be poorer for obese patients. We

must also immediately qualify this by saying it is also necessary to use reasonably

sophisticated digital image processing techniques ¶ to bring out the full information

in the image, in order to precisely place the points.

The appearance-based classifier dominates the ROC curves in each of the 3 spinal

regions over the ROC regions of practical interest (e.g. sensitivity exceeding 0.7,

¶i.e. non-linear contrast enhancements tuned to the local histogram

164



false positive rate below 0.2). At the 95% sensitivity point the appearance based

classifier is superior in every case. In the lumbar and mid-thoracic spines the false

positive rate is more than halved in comparison to the best other classifier, and in

the lower-thoracic spine the false positive rate drops to 4.7% compared to 6.4% with

a shape based classifier (significant on the McNemar test), or 18% with the best

height-based method. The shape classifiers outperform height based classifiers in the

thoracic spine, especially in the lower-thoraic spine, where a shape-based classifer is

nearly as good as an appearance classifier. However in the lumbar spine, the use

of a shape-based classifier does not outperform height methods in the 95% region,

although it does appear from the ROC curve (Figure 7.3) that it is more specific at

slightly lower sensitivity (in the 80-90% region).

In the thoracic spine, the use of appearance model classifiers allows a huge improve-

ment in specificity at the 95% sensitivity point. The false positive rate drops from

around 20% with current morphometric methods to about 4%, whilst in the lumbar

the false positive rate is halved. The significance of these results is confirmed by the

results of the McNemar tests.

At a sensitivity of 95% we obtain an overall false positive rate (FPR) of 4.9% using

appearance classifiers (roughly 5% equal error rate), whereas this increases to 12.4%

using only the shape parameters, and 18.3% using Eastell-McCloskey (or 16.2% with

a height ratio discriminant). The improvement in specificity is particularly marked

for the grade 1 fractures, as can be seen on Figure 7.4. The overall figure of 95%

sensitivity for the appearance classifier translates to 85.4% sensitivity against grade

1 fractures, 98.5% for grade 2 and 100% for grade 3 fractures (assuming the same

false positive rate of 4.9%). This compares to 75.2% against grade 1 fractures for a

shape classifier or 64.0% for a height ratio discriminant (65.2% Eastell-McCloskey),

at the same specificity. This indicates that the underlying appearance model does

capture some more subtle discriminating features than the simple height ratio or

shape classifiers; for example, information about the crispness of the edge of the

vertebral body. This may allow for example some of the false positive short vertebral

height wedge deformities, typical of the mid-thoracic region, to be rejected by the

appearance classifier. Once the fracture has reached grade 2, there is less to choose

between the three classifier types over much of the sensitivity range. At the same 4.9%

false positive rate, the shape and Eastell-McCloskey classifiers produce sensitivities of

98.5% (identical to appearance classifier) and 92.5% respectively. On severe fractures,

all linear discriminants are so good by this stage that it is not possible to tell any

165



difference.

Smyth at al [126] reported on results for classifying vertebrae using shape parame-

ters, but the dataset was small and confined to lumbar vertebrae. Our results extend

the method by using the appearance parameters. Furthermore in [126] a quadratic

classifier was used, which is theoretically optimal for Gaussian distributions with

unequal covariance matrices. We found that quadratic classifiers performed slightly

worse than the simpler linear classifier, even when stepwise regression was used to

reduce the number of features. This is probably because the number of fractured

training examples was too small to reliably estimate the population covariance ma-

trix of fractured parameters. In [126] a larger relative improvement was reported

between the shape parameter and height ratio classifier. However [126] used a more

complex shape model, including both the endplate inner contour, and the outer corti-

cal rim; whereas our shapes model only the endplate outer edge. We would expect to

capture at least a similar amount of information in the appearance model, where for

example secondary edges cause gradient highlights, which should produce particular

appearance mode parameters. De Bruijne et al [17] propose a Neighbour-Conditional

Shape Model. This predicts the expected (normal) shape given several neighbouring

vertebrae, and uses the total deviation from the predicted shape to classify a verte-

bra. The method was evaluated on lumbar radiographs. This does employ additional

prior information about the interrelations between vertebrae, but by using only shape

the method may be more prone than appearance-based classifiers to false positives

on non-fracture deformities. Our appearance classifier specificity at 95% sensitivity

seems good compared to that reported in [17] (95% compared to 84% in [17]), but

of course the datasets are different, and [17] used radiographs, which have the added

difficulty of projective effects.

Li et al [88] derived sensitivity/specificity figures for three radiologists using the semi-

quantitative method, compared to a gold standard derived by a consensus reading

involving also a fourth radiologist expert in the SQ method. The median sensitivity

was 88% with a specificity of 98% (so FPR of 2%). Our overall sensitivity with the

appearance classifier at this FPR is 88.3%, strikingly similar to that of an experienced

radiologist using SQ on radiographs. Of course our gold standard is less rigorous than

that of [88], and the radiologists’ reporting in our study might have been different in

some cases on radiographs rather than with the poorer quality DXA images. Both our

DXA gold-standard consensus reading and the classifier could have been wrong about

particular vertebrae, if compared to an even more rigorous gold standard derived

166



from a consensus reading of radiographs. Nevertheless it is interesting that, within

the limitations of DXA image quality, the concordance of our appearance classifier

with expert reading can be configured to be comparable to that of inter-radiologist

concordance using SQ on radiographs.

Examining table 7.8 we see that at the coarser level of an overall patient result,

there is less difference between the different classifiers than would be expected from

single-vertebra performance. The overall patient FPR ‖ for a 5% single vertebra FPR

is around 19% for shape and appearance classifiers, which is about half what would

be expected if all vertebra were independent. When OR-ing all individual vertebrae,

there is little difference between shape and appearance classifiers, though these still

outperform morphometric methods. On the whole, differences in classifier perfor-

mance appear swamped by the statistical correlations between fracture occurrence in

different vertebrae in the same individual. This may be partly a result of our dataset

being deliberately fracture-enriched, including an over-representation of severely os-

teoporotic patients. In a more truly representative population, we might see more

of the individual vertebrae performance differences preserved. Also these diagnosis

results at the patient level reflect the system performance viewed as a stand-alone

automatic system. In reality we envisage our methods being used as an aid to a clin-

ician, and in this context the appearance classifier would result in fewer overall false

alarms at the vertebral level, which might lessen clinician workload in re-examining

each false positive vertebra on the patient’s image.

7.6 Classifying given a semi-automatic segmenta-

tion

7.6.1 Semi-automatic method

The previous results are based on a detailed manual segmentation of all the vertebrae,

and so represent an idealised classifier performance. We also investigated classifier

performance using the semi-automatic segmentation obtained from our AAM meth-

ods as described in the previous chapter. We stored the segmented shapes obtained

by using quintet sub-models with the fractured alternative initialisation method (see

‖i.e. where a positive patient result is one where any of the vertebrae is diagnosed as fractured

167



section 6.5.6). In fact for each image we saved 10 replications, using different ran-

dom errors in the initialisation (i.e. randomising over the vertebral centre locations

to simulate the precision of a clinician clicking on the vertebral centres). Thus in

these experiments the test-set has each image present 10 times, but with slightly dif-

ferent segmentations. The vertebral shape and appearance models, and all classifiers

were then trained exactly as above (i.e. using the manual segmentations for classifier

training), and a similar leave-one-out experiment was run, but this time each test

case was evaluated using each of its 10 automatic segmentations. Sensitivity and

corresponding false positive rates were then evaluated exactly as before.

7.6.2 Semi-automatic Results

Figure 7.8 shows the ROC curves obtained from the semi-automatic segmentations,

with all three spinal regions combined. ROC curves for the shape and appearance

classifiers are given in this Figure, with a baseline ROC curve (with legend “Heights”)

for 3 height morphometry (“Eastell-McCloskey hybrid”). It can be seen that overall,

below a sensitivity of 75%, the shape and appearance classifers are almost indistin-

guishable, but as the FPR increases the appearance classifier dominates. Both shape

and appearance classifiers appear better than height-based morphometry. The over-

all ROC curve does mask certain performance differences in the different regions of

the spine. Table 7.9 shows the classifier sensitivities at FPR of 1%, 2% and 5%

for the 3 spinal regions, and 3 fracture grades. The advantages of the appearance

classifier seem greatest in the mid-thoracic spine, which accords with the results with

the “gold-standard” manual segmentation, and which interestingly is where morpho-

metric methods are particularly prone to false positives on mild wedges due to other

diseases (e.g. degenerative disc disease). The appearance classifier gives the best sen-

sitivity at 5% FPR, but at lower FPR (1%) it is outperformed at T10 and below by a

pure-shape classifier, and in the lower thoracic spine even by standard morphometry.

The advantages of the appearance classifier seem to manifest more as the sensitivity

is increased. Overall the sensitivity of the appearance classifier at 5% FPR is reduced

from around 95% with ideal (manual) segmentation to 86% with the semi-automatic

segmentation, a loss in sensitivity of 9% due to segmentation errors. The reduction

remains approximately constant across all fracture grades - even severe fractures only

give 92% sensitivity, which is not surprising given that 8% of the (grade 3) edges had

been mis-fitted to a neighbouring vertebra (see table 6.19).

168



Figure 7.8: ROC curves for (semi)automatically-segmented images, with all vertebraecombined

Figure 7.9 shows ROC curves for the appearance classifier only, but separated by the

3 fracture grades. The areas under the various ROC curves for classifier diagnoses

given the semi-automatic segmentation are given in table 7.10. The appearance

classifier gives the best ROC curve area in every category.

The patient-level sensitivity/FPR are summarised in table 7.9. At an individual ver-

tebra FPR of 2.5%, a patient level FPR of under 10% was obtained for an appearance

classifier, with 91% sensitivity. To obtain 95% sensitivity at patient level, it would be

necessary to increase the FPR to around 22%. Although these patient statistics are

relative to our (fracture-enriched) dataset, they indicate a good enough performance

to encourage practical use of the technique as an aid to a clinician, or possibly even

in triage. We also anticipate that with a modest degree of user-correction of a few

poorly segmented vertebrae, the system performance could be improved to approach

the limits established for the classification on a fully manual segmentation.

7.7 Conclusions

We have developed linear discriminants for detecting vertebral fracture using shape

and appearance parameters. The main advantages of using the more complex ap-

169



Spinal Region Classifier Sensitivity given FPRor Grade Type 1% 2.5% 5%Mid-Thoracic E-M Height 41.0 58.9 73.1

Shape LD 54.8 66.0 76.2Appearance LD 61.5 71.4 83.3

Lower-Thoracic E-M Height 69.4 75.6 82.7Shape LD 66.7 82.6 88.5Appearance LD 60.0 79.1 89.4

Lumbar E-M Height 59.3 68.1 79.2Shape LD 74.6 78.6 82.4Appearance LD 72.1 82.1 87.0

Grade 1 E-M Height 34.1 45.5 59.9Shape LD 44.2 62.0 69.5Appearance LD 42.0 61.0 75.2



All E-M Height 57.6 66.4 74.8Shape LD 66.0 77.0 82.6Appearance LD 66.9 77.8 86.0

Table 7.9: Classifier Sensitivities for 1%, 2.5% and 5% FPR, for semi-automatic segmen-tation. E-M Height means the Eastell-McCloskey morphometric method.

Spinal Region Fracture GradeClassifier MT LT Lum G1 G2 G3 AllEastell-McCloskey 0.8878 0.9456 0.9131 0.8623 0.9238 0.9586 0.9164Height Ratios LD 0.8899 0.9462 0.9119 0.8672 0.9255 0.9507 0.9168Shape LD 0.9167 0.9617 0.9347 0.8889 0.9502 0.9672 0.9376Appearance LD 0.9366 0.9630 0.9484 0.9085 0.9615 0.9708 0.9490

Table 7.10: Area under ROC curves given semi-automatic segmentation. Columns labelledMT, LT and Lum refer to mid-thoracic (T7-T9), lower-thoracic (T10-T12) and lumbar (L1-L4) vertebrae respectively; those labelled G1, G2, G3 refer to fracture grades 1, 2 and 3respectively.

170



Figure 7.9: ROC curves for appearance classifier on (semi)automatically-segmented im-ages, for the 3 fracture grades

Classifier Vertebral Patient PatientType FPR (%) FPR (%) Sensitivity (%)Eastell-McCloskey 2.5 13.1 89.4Shape LD 2.5 10.0 89.8Appearance LD 2.5 9.8 90.8Eastell-McCloskey 5.0 25.6 92.7Shape LD 5.0 19.8 94.7Appearance LD 5.0 21.9 95.0

Table 7.11: Overall Patient-Level FPR and Sensitivity given individual vertebrae FPR

171



pearance or shape parameter discriminants rather than height ratio methods are in

detecting grade 1 fractures. For lumbar vertebrae, the appearance-based classifier

can approximately halve the false positive rate when operating at 95% sensitivity

compared with traditional quantitative morphometric methods; and for thoracic ver-

tebrae the reduction is approximately four-fold.

There is generally an advantage in using a fuller shape description, though this is not

always apparent with lumbar vertebrae, but the better performance of the appear-

ance classifiers indicates that the underlying appearance model also captures textural

indicators of fracture, such as the more complex edge structure associated with end-

plate collapse. However the more complex appearance classifier also displays some

sign of undertraining, as it does not perform well at very high sensitivity (98%). Nev-

ertheless this sensitivity level is essentially unrealistic, as it produces unacceptable

false positive rates for all classifiers.

The results obtained when classifying on the basis of a semi-automatic segmentation

are not as good, but still very promising. At a false positive rate (per vertebra)

of 5% the overall sensitivity is 86% for the appearance classifier, compared to 75%

for standard morphometric methods. We would anticipate a modest degree of user-

correction of the segmentation, which should allow the results to approach those of

the manually segmented solutions.

Further improvements in appearance-parameter based classification might be made

by using more sophisticated training methods and non-linear kernel methods: for

example a Support Vector Machine [136] with radial basis function kernels. Neither

have we investigated what improvement might be made by using information from

neighbouring vertebrae. We have used the appearance parameters as a compact way

of capturing both shape and texture information, but one disadvantage of using a

model is that in a few cases which are not well fitted by the appearance model, the

parameters may not represent the actual texture of the vertebra very well. It might

also be possible to develop purely data-driven texture descriptors that can be used

in similar classifiers. There is thus further scope for developing reliable quantitative

methods of classification. So although current morphometric methods are not widely

trusted to be reliable, there is still scope for reliable quantitative appearance-based

methods. Indeed the 5% equal error rate we have already established is in practice

probably not much worse than typical inter-radiologist concordance, especially when

one considers the subjectivity involved in the widely accepted Genant SQ method.

172



Our development of more sophisticated quantitative methods also creates the possi-

bility of replacing the current grading system (grades 1-3) with a more continuous

measure based on the perpendicular distance from the classification boundary. This

might make it quicker to spot subtle worsening of existing fractures in longtitudinal

studies (e.g. in a one-year follow up). In principle this could improve the statisti-

cal power of clinical trials, resulting ultimately in cost-savings as fewer subjects (or

shorter studies) would then be needed. Similar points are made by de Bruijne et al

in [17]. There would also be the future potential to combine a quantitative DXA

vertebral fracture measure with a BMD assessment using the same equipment. For

example a patient who might fall in the osteopenia class on BMD alone, but who also

had say two grade 1 vertebral fractures, might be considered as in fact osteoporotic.

173

Chapter 8

Segmention of vertebrae in

radiographs

8.1 Introduction

In previous chapters we have presented results for segmenting and classifying ver-

tebrae on DXA images. In this chapter we present results on the segmentation of

radiographs. The work of this chapter has been previously published in [115].

Radiographs are somewhat more challenging as the fan beam used in conventional

radiography can lead to parallax errors and apparent scale changes, and there is

considerable variation of contrast across the image. In this chapter we apply our

segmentation methods to a dataset of lumbar radiographs, as the first step in obtain-

ing more reliable quantitative classification of vertebral fracture must be to achieve a

reliable automatic segmentation. Some success in automatically locating vertebrae in

radiographs has been reported by Howe et al [77] and by de Bruijne et al [18]. Howe

used an AAM approach not dissimilar to ours. First a Generalized Hough Transform

was used to locate a plausible starting position. Then a global AAM was fitted, and

finally the appearance model parameters for individual vertebrae were reset to half-

way between the global model solution and the mean; and then individual vertebrae

AAMs were run to give the final solution. However we have not found individual

vertebrae models to be reliable, and we have also found that the use of the global

model first tends to locate local minima when fractures are present. So we have used

174

Chapter 8. Segmention of vertebrae in radiographs

the sub-model approach initially trialled on DXA images.

De Bruijne et al [18] used a shape model to generate a large ensemble of candidate

solutions, and then evolve the solution set using shape particle filtering in conjuc-

tion with a bank of nearest neighbour classifiers based on a feature set of Gaussian

derivative filters up to 3rd order. Each pixel in a candidate shape, is assigned a

probability of being background, within vertebra, or on the vertebral boundary, and

then an overall likelihood measure is derived for the whole shape and used to evolve

the particle ensemble. The need to evolve a large candidate set means the method is

computationally rather expensive, but it is a fully automatic method, and the shape

mutation methods used mean that the final solutions are not over-constrained by

shapes in the training set.

In this chapter we assess the accuracy of the triplet AAM sub-model approach to

segmenting vertebrae on lumbar radiographs. Although radiographs typically have

better resolution and signal to noise ratio, the shape and appearance of the vertebrae

are more complex due to projectional parallax effects. The divergent beam used in

conventional radiography causes a variable scaling across the image, and can cause

severe apparent tilting of the vertebral bodies. Also as the more extreme vertebral

bodies tend to be obliquely irradiated, their superior and inferior endplates typically

appear as elliptical rims, rather than the more linear edge typical of DXA. Figure

8.1 shows a typical lumbar radiograph, with some contrast enhancement to ensure

all vertebrae are simultaneously visible. Figure 3.6 in Chapter 3 shows an example

of severe apparent tilting.

8.2 Materials and Methods

8.2.1 Data

The images used were obtained from anonymised radiographs collected in a previous

epidemiological study [23], with the permission of Professor Cyrus Cooper. We have

thoracic and lumbar radiographs, but have initially just used lumbar radiographs

as these are the more straightforward case due to less clutter (e.g. from lungs and

ribs), and a lower fracture prevalence. The dataset consisted of 250 lumbar radio-

175


Figure 8.1: Lumbar radiograph. a) shows the raw image (contrast enhanced); b) showsthe automatically located vertebral contours superimposed.

graphs, digitised using a Vidar ∗ Diagnostic Pro Advantage digitiser at 300dpi and

12 bit intensity resolution. This Vidar digitiser allows a variety of analogue to digital

conversion mappings. As there is typically a large range of brightness/contrast at

different vertebral levels in the radiographs it was important to select a transform

that preserved information across a large dynamic range. The default logarithmic

transform did not work well on these images, as it typically “washed out” the often

brighter vertebrae in the lower lumbar, whereas using a more nearly linear transform

had the opposite effect of losing information in the typically darker upper portion

(T12/L1). After some initial experimentation it appeared that the “power 3” † option

gave the best compromise performance.

The digitised images were manually annotated using an in-house tool‡, by an experi-

enced radiograper, supervised by the author and with some advice from JEA. Each

vertebral contour uses 60 points around the vertebral body with 8 further points

around the pedicles. The endplate rims were modelled using a quasi-elliptical shape,

rather than the single edge previously used for DXA images. No images were included

∗Vidar Systems Corp, Herndon VA, USA†manufacturer’s designation, in fact it appears to be a cube root‡written by the author in C++ using the Trolltech Qt GUI library, existing AAM code and using

bootstrapped AAM submodels, see section 6.3

176


Figure 8.2: Zoomed in view of L3 showing its shape model points

where the projectively induced tilting was so severe that lumbar vertebrae appeared

to interpenetrate each other (with the occasional exception of the extreme T12/L1 or

L4/L5 pairs). Such images are extremely difficult to read, even by an expert radiol-

ogist, and can lead to unreliable diagnosis. Extreme projective tilting can be caused

by setup error, patient positioning, or can be the result of an intrinsic scoliois. Figure

8.2 shows a zoomed in view of L3 with its shape model points displayed.

8.2.2 AAM approach

The dynamic linked AAM approach of chapter 6 was used to fit a sequence of three

AAMs composed of overlapping vertebral triplets covering the spine from L4 up to

T12. Note that L5 is not normally used in vertebral fracture assessment as it is very

rare for L5 to suffer osteoporotic fracture, and it may be obscured by the iliac crest.

The three triplet models used were T12/L1/L2, L1/L2/L3 and L2/L3/L4. A slight

variation from the DXA method was that affine transforms rather than similarity

transforms were used for the shape model pose. Allowing shearing (i.e. different x

and y scaling) is a better approximation to the perspective distortion induced by the

fan beam than assuming isotropic scaling. Each triplet sub-model has its own affine

pose parameters, thus allowing for variation in projective effects across the image.

Figure 8.3 shows the variation in the first two shape modes of the L2-centred triplet.

177


Figure 8.3: L2 triplet 3SD variation in first (left) and second (right)shape modes

T12 was included to form the uppermost L1-centred triplet, although there were

some lumbar radiographs in which T12 was not fully visible. Nevertheless in general

T12 should be visible on a lumbar radiograph, and results from DXA lead us to

believe that it is helpful in fitting L1 to also include the neighbouring T12 in the

model. In fact sometimes T12 is better visualised on the lumbar radiograph than on

the thoracic. There was often a high variation in brightness and contrast across the

different vertebral levels. For example T12 or even L1 were often very dark, and could

often not be seen without some local contrast optimisation, whereas L4 typically had

an over-bright “washed-out” appearance. Figure 8.1 is typical in this respect (though

the displayed figure is after contrast enhancement - L1 would be barely visible on the

original). Another advantage of decomposing the overall shape into sub-structures

is that the texture normalisation can be better tuned to the local brightness and

contrast, where there is substantial variation in these across the image.

As there is little useful information inside the vertebral body we used profile samplers

for the AAM texture model, rather than the triangulated region samplers classically

used with an AAM. We have already extablished that profile samplers work better

on DXA images (see chapter 6). The profile samplers used were similar to those

used in DXA images, but due to the better resolution of radiographs they included

additional scales. The samplers extracted the gradient perpendicular to the local

shape, and this was non-linearly renormalised using a sigmoidal function tuned to

the mean absolute gradient [32, 123] over the entire profile set. We used a 4-level

multi-resolution pyramid search, to extend the convergence zone, with 8 samples

either side of the shape. The finest level step size was 0.375mm, and the images were

pre-smoothed up to a resolution of 0.1694 mm per pixel (i.e. one level of Gaussian

pyramid up-smoothing). Thus the profile step size represents about 2 pixels (at each

level of the pyramid). The extracted gradient is Gaussian smoothed across the local

178


tangent, with a smoothing window equal to the step length (on each side). We also

experimented with a profile sampler which concatenated this profile with a similar

profile sampler extracting a measure of image corner strength (“cornerness”), related

to the Harris corner detector [73], as in [123]. As the corners of the vertebrae are of

physical interest in standard morphometry, it was thought that including a cornerness

measure in the AAM might improve the accuracy at points of important diagnostic

interest. Furthermore the projective parallax and oblique beam orientation tend to

introduce curved features in the region of the profile. The cornerness measure has

the further advantage that it implicitly includes feature information from a somewhat

larger region, as the measure is based on the structure tensor (∇I∇IT ) §, which is

Gaussian smoothed over a locally square region with semi-width twice the profile step

length. See [123] for details.

8.2.3 Experiments

Leave-25-out tests were performed over the 250 images. As AAMs perform local

search an approximate initialisation somewhere in the vicinity of the vertebrae is

needed. The initialisation was performed as for DXA (method 2) on the approximate

centres of the vertebrae. We simulated the precision of a clinician clicking on the

centres of each vertebra by adding zero-mean Gaussian errors with SD of 2mm in

the y-direction (along the spine) and 3mm in the x-direction (as for DXA). Twenty

replications (i.e. random initialisations) of each image were performed.

8.3 Results

The accuracy of the search was characterised by calculating the absolute point-to-line

distance error for each point on the vertebral body. Table 8.1 compares results for

the two profile samplers used with the data separated into points within normal or

fractured vertebrae. Each row gives the mean, median and 75th percentiles, and the

percentage of point errors in excess of 2mm. The threshold of 2mm would be around

2.5 SDs of manual precision, and can be viewed as a point failure indicator.

Table 8.1 shows that the results are worse for fractured than normal vertebrae.

§with the Cartesian image gradient ∇I as a column vector

179


Gradient Only Sampler Gradient & Corner SamplerAccuracy Normal Fractured Normal FracturedStatistic Vertebrae Vertebrae Vertebrae VertebraeMean (mm) 0.71 1.11 0.64 1.06Median (mm) 0.46 0.62 0.43 0.6175%-ile (mm) 0.89 1.34 0.82 1.32%ge errors> 2mm 6.2% 14.1% 4.6% 13.3%

Table 8.1: Search Accuracy Percentiles by Fracture Status for the two profile samplersused

A more detailed examination by fracture grade gives mean accuracies of 0.84mm,

1.79mm and 3.35mm for fracture grades 1, 2 and 3 ¶ respectively (for the gradient

and corner sampler). However there was a low fracture prevalence in the lumbar

region in the sample, and these figures are based on 17 grade 1 fractured vertebrae,

two grade 2, and only a single grade 3 fracture.

The more sophisticated profile sampler including a corner measure appears to produce

a small improvement in accuracy of around 0.07mm. This represents a 10% reduction

in mean error. We confirmed that this difference is statistically significant at the 1%

level by calculating a 99% confidence interval for the mean difference between the two

samplers, using hierarchical bootstrap resampling of the differences in errors between

the two profiles as in [122]. This symmetric (in probability) 99% bootstrapped

confidence interval on the mean difference was [0.048,0.082]. As this interval does

not span zero, the difference is significant at the 1% level.

8.4 Discussion

8.4.1 Overall Accuracy Performance

The mean segmentation accuracy of 0.64mm on normal vertebrae is comparable to

manual precision in point placement, and to our previous results on DXA images [117].

Over 95% of points in normal vertebrae are located to within 2mm of the manually

annotated outline. However the dataset contained a very low prevalence of fractured

vertebrae, and so the shape models are evidently undertrained, for fractures above

¶i.e. mild, moderate and severe fractures, see [65]

180


grade 1. Therefore within the limitations of the small fractured sample it appears that

the results deteriorate with increasing fracture grade. However given our previous

reasonable accuracy achieved on fractured vertebrae with DXA images, we believe

that this problem could be solved by adding more fractured training examples. The

mean accuracy is better than other comparable cited figures in the literature [77, 18].

For example de Bruijne et al [18] obtained a mean point-to-contour accuracy of 1.4mm

on lumbar radiographs using shape particle filtering, which is more than double the

size of error achieved by our AAM approach. On the other hand this was for a fully

automatic search with no approximate manual initialisation such as we use. Howe

et al [77] state that 68% of points on lumbar radiographs were located to within 25

pixels. We understand the dataset used had a resolution of 0.174mm per pixel, so

this is equivalent to a 68th percentile of 4.35mm, clearly substantially worse than

our 75th percentile of 0.85mm. However again Howe et al were using a completely

automatic method, with the AAM being initialised to the best template match found

by an initial Generalised Hough Transform.

8.4.2 Conclusion

In conclusion the results confirm the feasibility of substantially automating vertebral

segmentation on radiographs, although the shape models need better training on

fractured vertebrae. Within the limitations of the dataset, the projective effects of

spinal radiography do not appear to present any substantial problem to an AAM-

based approach.

8.4.3 Future Work

Further fractured training examples are clearly required, and we also intend to extend

the work to the thoracic spine, which tends to contain more osteoporotic fractures.

A dataset is currently being annotated by an experienced radiographer.

Use of the shape and appearance parameters of the fitted models could in future

provide a means of classifying vertebrae as normal, fractured, or otherwise deformed,

as we have already demonstrated for DXA data. Current simplistic quantitative mor-

phometric methods are unreliable, especially for mild fractures, but the appearance

parameters may provide a quantified form of some of the more subtle aspects of visual

181


or semi-quantitative expert reading of vertebral fractures. We therefore view obtain-

ing a reliable automatic segmentation as the first step in achieving a Computer Aided

Diagnosis (CAD) system for the diagnosis of vertebral fracture from radiographs.

182

Chapter 9

Conclusions and Further Work

This chapter summarises the work described in this thesis, highlighting its novel

contributions and its potential for real clinical use. Areas of future development are

also summarised.

9.1 Summary of Original Work and Results

9.1.1 AAM methodological developments

We have presented a general algorithm for combining multiple AAM sub-models. Us-

ing multiple sub-models can mitigate the undertraining problem that can be inherent

in statistical models, and also allows for a greater range of pathological cases, or local

illumination/contrast effects. We envisaged that these sub-models could typically be

overlapped to provide linkage; although we also allow for the case where the only link-

age is provided by a global model, which is used to re-predict initial search locations

for latterly-fitted sub-models, given the latest set of already-fitted sub-models. We

also developed a technique of dynamically sequencing the sub-models, thus allowing

the fitting order to be determined by the data. This was shown to lead to modest

but significant improvements in accuracy on the DXA dataset. It also facilitates

generalisation of our method to cases where there is no natural fitting order.

The dynamic sequencing approach also lends itself to allowing multiple initialisations

of each sub-model in order to improve robustness in locating pathological cases.

183

Chapter 9. Conclusions and Further Work

9.1.2 Vertebral Segmentation

We have collated a large database of DXA images, and the dataset has been enriched

with a high fracture prevalence. This has allowed testing of the AAM segmentation

techniques under realistic operating conditions, such as would be encountered in

clinical use. The performance of the segmentation techniqes has been extensively

verified against the full range of fracture grades, and good accuracy is maintained

even against severe fractures, although inevitably there are a modest number of search

failures, which increases with fracture grade. Good performance can be maintained

by using an alternative “fractured” initialisation of each AAM sub-model.

We have evaluated an extensive set of alternative AAMs. We found that profile AAMs

performed better than the classical triangulated region AAM, even when the more

sophisticated feature-AAM of Scott et al [123] was used for the latter. We optimised

the sub-model structure used, finding that groups of 5 vertebrae (quintets) were

marginally superior, though the performance gain over triplets was small, and largely

confined to severe fractures. We found that single vertebra models were unreliable.

At the other extreme we confirmed that a single global model was also unable to

accurately segment fractured cases, though with normal vertebrae its performance

was more similar to the sub-model approach. This somewhat contradicted our earlier

result obtained on a smaller training set [113]. It appears that with a small training

set there is more gain to be obtained from decomposing the structure into multiple

sub-models, but this accuracy difference is gradually eroded as the training set is

extended; but only for normal cases. The global AAM still maintains too great an

a priori bias towards the mean shape to cope with pathologies or unusual sub-shapes.

We therefore recommend our sub-model approach.

We have obtained a location accuracy that is comparable to manual precision of point

placement for osteoporotic patients in around 90% of cases.

We have also applied our methods to a dataset of 250 digitised lumbar radiographs,

with similarly good results.

184


9.1.3 Vertebral Classification

We investigated the use of both shape model and appearance model parameters in dis-

tinguishing vertebral fractures from both normal vertebrae, and other short vertebral

height deformities. We used linear classifiers, and optimised the proportion of texture

variance retained in the model. We compared our classifiers with methods based on

just the standard 3 vertebral heights, and demonstrated a convincing improvement

in specificity for given sensitivity. The false positive rate (per vertebra) at 95% sen-

sitivity is around 5% using an appearance classifier. Thus at 95% sensitivity we have

roughly halved the false positive rate for lumbar vertebrae compared with traditional

quantitative morphometric methods; and for thoracic vertebrae the reduction is ap-

proximately four-fold. This is a substantial improvement in the clinical applicability

of quantitative techniques. There had been a view among many radiologists that

quantitative methods were too unreliable to be of much clinical use. However we

believe that, because our appearance classifiers can capture at least some of the more

subtle distinguising features used in expert evaluation, a reliable Computer Assisted

Diagnosis method is possible. This will be particularly valuable in situations where

DXA scans are conducted in units other than radiology departments. Furthermore

by using measures based on the distance in the appearance parameter hyperspace

from the classification boundary, we will be able to detect more subtle longtitudinal

changes in patients. This should have applicability to a wide range of longtitudinal

clinical trials.

We also evaluated a combined semi-automatic segmentation and classification, and

confirmed that with a semi-automatic segmentation at 5% FPR we can achieve 75%

sensitivity for grade 1 fractures, and over 90% on grades 2 and 3 combined. A modest

degree of user-correction to the segmentation should allow even better performance.

At the overall patient level (combining all vertebrae from L4-T7) this results in a

patient diagnosis sensitivity of 95% with 75% specificity; alternatively reducing the

individual vertebra FPR to 2.5% gave an overall patient specificity of over 90% with

similar sensitivity (10% equal error rate). These patient-level figures partly reflect

the fracture-enriched nature of our dataset.

185


9.2 Future Work

9.2.1 Other modalities

We have mostly used DXA images in this work. However the “gold standard” for

fracture evaluation in both radiology departments and clinical trials is still the spinal

radiograph (or computed radiography). We have collated an extended set of both

thoracic and lumbar radiographs and intend to evaluate both our segmentation and

classification techniques on these.

Because of the scan speed time, single energy SXA vertebral fracture evaluation is

more common than DXA, despite the fact that the mid-thoracic vertebrae are not

as well visualised, due to soft tissue and diaphragm motion artefacts. We therefore

intend to also evaluate our techniques in single energy SXA mode.

9.2.2 Classifer improvements

We have only evaluated classifiers using a single vertebra. There could be some gain

in performance by using larger models, such as a triplet or quintet. By including the

neighbours in the model, some further shape information would be available. This

might be helpful for several reasons. Firstly if no shape information on neighbour lo-

cation is included, it can be difficult to distinguish the additional edge texture caused

by sampling into the converse edge of a neighbour from the cortical rim remnant of

a fractured endplate. Secondly the neighbour location allows better definition of the

inter-vertebral disc space, and the texture therein can be useful in differential diagno-

sis. Thirdly there may be some subtle conditional shape effects that are useful. Also

certain kinds of vertebral deformities tend to affect several neighbouring vertebrae

(e.g. mild wedging due to spondylosis).

However by increasing the number of parameters input to the classifier, there would

be a danger of causing training problems. If some spurious correlations are present

in the training set, the classifier may even generalise less well and actually perform

worse than a single vertebra one. We would therefore also investigate alternative

training methods, specifically the Support Vector Machine [136]. For purely linear

classifiers the SVM training method places no weight on instances that are far from

186


the boundary (e.g. severe fractures or clearly normal cases), instead being based on a

more parsimonious set of “Support Vectors” near the boundary. Also by using kernel

methods, such as a Radial Basis Function kernel, the parameter input space can be

projected into a higher dimensional space in which better linear separation can be

achieved. This allows non-linear relations in the original space to be modelled.

The texture model used to build the appearance model implicit in the classifiers may

still not be optimal. We intend to investigate use of multi-scale texture samplers

(e.g. Gaussian smoothed derivatives at several scales), as well as the inclusion of

other feature detectors, such as the edge/corner AAM samplers developed by Scott

et al [123].

9.2.3 Automatic Detection of Search Failure

There will inevitably be some search failures, especially in cases of severe osteoporosis

with multiple grade 3 fractures. Ideally the segmenation system should be able to

evaluate its own performance, and detect when a search failure was likely. This might

even allow a degree of self-recovery. For example on failure a more extensive set of

alternative AAM initialisations might be tried. The likely failure could be highlighted

to the user in a clinical system, and the vertebra’s classification changed to unknown.

An obvious measure to try and use is the residual sum of squares. However because

the images are so noisy and often have low values of signal, it can be difficult to

distinguish between a correct segmentation of noisy data, and an incorrect fit to

the background; or a partially correct fit with some confusion with a neighbouring

vertebra’s edge. Our work on using a quality of fit measure in the dynamic AAM

sequencing method has also established that the residual sum of squares is far from

being χ2 distributed. We anticipate that in true failures, there are likely to be sets

of residuals that show strong spatial correlation. We therefore intend developing

methods that use the spatial structure of the residuals, as well as their total sum of

squares, to assess whether a search has been successful or not.

187


9.3 Final Statement

We have successfully demonstrated the use of AAM-based methods to locate and

classify vertebral fractures due to osteoporosis. The segmentation accuracy we have

achieved approaches that of human precision. The specificity of current quantitative

fracture diagnosis techniques has been substantially improved. This will enable real

improvements in early diagnosis of osteoporosis, which could have a significant effect

on the health of millions of patients. In parallel with this work we have already

developed a prototype clinical tool utilising these techniques, and hope that our

methods become adopted by clinicians.

188

Appendix A

A.1 Weighted fitting of shape and appearance model

parameters

It is necessary to compute the shape model parameters bs with respect to a weighted

vector of points X, where the weight of each point indicates its relative importance.

Often these weights w will be the reciprocals of the estimated point error variances.

As well as calculating model parameters it is generally necessary to calculate an

alignment transform Tt with pose parameters t to approximately align the shape to

the mean model shape x; and also to apply the shape model constraints. As there

may be some interaction between these 3 stages, in general several iterations around

a loop are performed (e.g. 5), as defined in Algorithm 4.

In general the point weights need not be isotropic. In fact in the constrained AAM

search in general anisotropic weights are assumed. However there were certain limi-

tations in the general structure of the existing higher level API’s in the existing C++

code that we used. This uses a class hierarchy set up for general abstract active mod-

els (i.e. a generalisation of the AAM or ASM), and assumes isotropic weights when

performing weighted fits of the shape model to a set of target points. As we only

use these functions to provide an approximate initialisation to the AAMs we have

not modified the existing API’s. Instead we renormalise the 2n dimensional weights

vector w to an n dimensional isotropic form w′. Assuming for convenience that the

x cartesian coordinates occupy the first n elements in the representation, and the y

cartesian coordinates occupy the second block of n elements, we use the mapping:

189

Appendix A.

w′j = 0.5(wj + wj+n); j ∈ {1 . . . n} (A.1)

Then let W be a 2n dimensional square diagonal matrix, defined as:

[W]j,j = w′j; j ∈ {1 . . . n} (A.2)

and

[W]j,j = w′j−n; j ∈ {n + 1 . . . 2n} (A.3)

Algorithm 4 Weighted fitting of shape model to target world-frame points

1. b = 0

2. counter=0

3. While counter < Max Iterations do

(a) xm = x + Psb

(b) Calculate best transform parameters t so X ≅ Tt(xm).See section A.2

(c) xp = T −1t (X)

(d) b = argmin((xp − x −Psb)TW(xp − x − Psb))See section A.3

(e) If∑i=m

i=1b2

i

λi> Dmax then

i. Shift b to nearest point b′ on hyperellipsoid so that∑i=m

i=1b′

2

i

λi= Dmax

(f) increment counter

A.2 Optimal pose parameters

We assume two sets of n points {xi : 1 ≤ i ≤ n} and {x′i : 1 ≤ i ≤ n}. We seek to

align the shape represented by {xi} to that represented by {x′i} so that the weighted

square error norm is minimised. Note in this section a shape is represented by the set

190

Appendix A.

of 2D points, where xi is the 2D ith point, in contrast to the usual notation of x as a

2n dimensional vector representing all the points. We represent the weightings by a

set of weight matrices Wi, where Wi = wiI2; so in other words the point weightings

are isotropic as discussed above. A treatment of the more general anisotropic case is

given by Cootes in [34]. The assumption of isotropic weightings allows us to make

some simplifications.

We seek a transformation Tt with parameters t so as to minimise:

E =

i=n∑

i=1

(x′i − Tt(xi))

TWi (x

′i − Tt(xi)) (A.4)

Solutions may be obtained in general by setting δEδt

= 0

For a similarity transform, with translation, scaling and rotation we can combine the

rotation and scaling factors into only two parameters a, b, together with a translation

shift d = (tx, ty)T , so that, defining the rotation/scaling matrix S:

S =

(a −b

b a

)(A.5)

we have:

Tt(x) = Sx + d (A.6)

and t = (a, b, tx, ty)T .

First we define the following summation forms.

SxWx =∑

xTi Wixi SWx =

∑Wixi SW =

∑Wi

SxWx′ =∑

xTi Wix

′i SWx′ =

∑Wix

′i

(A.7)

We also define the rotation matrix J which transforms x into its normal, i.e.

J =

(0 −1

1 0

)(A.8)

We define the further summation forms:

SxWJx′ =∑

xTi WiJx′

i SWJx =∑

WiJxi (A.9)

191

Appendix A.

The pose parameters are then given by the solution to the equation:

SxWx 0 STWx

0 SxWx STWJx

SWx SWJx SW

a

b

tx

ty

=

SxWx′

SxWJx′

SWx′

(A.10)

A.3 Weighted fitting of shape model parameters

Having aligned a world shape to the model frame we next seek optimal shape model

parameters to minimise the weighted L2 norm:

(xp − x −Psb)TW(xp − x − Psb) (A.11)

Define d = xp − x − Psb. Differentiating dTWd and setting derivatives to zero for

the minimum leads to the equation:

PTs WPsb = PsW(xp − x) (A.12)

Define A = PTs WPs. Equation A.12 is of form

Ab = z with z = PsW(xp − x) (A.13)

Also AT = PTs WTPs. But as W is diagonal WT = W. Hence also AT = A.

Therefore equation A.13 can be solved efficiently by Cholesky decomposition, though

in cases when the solution is ill-conditioned SVD can be used instead.

A.4 Applying additional appearance model con-

straints

When the sub-models are initialised, in addition to first performing a weighted fit of

the shape model to the required points, we also apply additional appearance model

constraints. This is done by assigning the combined (shape parameter , texture

parameter) vector ba (see equation 4.17) using the required shape model parameters,

192

Appendix A.

and with the texture parameters set to zero (i.e. mean texture is assumed), but the

weights on the texture parameters are all set to zero. Fitting the appearance model

to the shape parameters is in essence conceptually the same as the weighted fit of the

shape model to a set of points already discussed in the previous section. We require

a solution to

QTWaQc = QWaba (A.14)

The weights matrix Wa is a diagonal matrix with the first ms elements along the

diagonal set to unity (i.e. shape parameters all have the same weight), and then the

remaining mt parameters are set to zero (i.e. we attach no importance to the texture

parameters).

The use of zero weights on the texture parameters is liable to lead to equation A.14

being not full rank, so SVD should be used to solve it.

Finally the appearance parameter constraints on maximum Mahalanobis distance are

imposed, and if c is outside the allowed hyper-ellipsoid, then it is brought back to

the nearest point on the allowed hyper-ellipsoid. This may implicitly mean that the

shape is then altered slightly, as the actual shape model parameters that will then

finally be used are re-derived from c using the appearance model (i.e. the shape

model parameters are determined by the first ms elements of Qc)

193

Bibliography

[1] Adams JE. Dual-Energy X-ray absorptiometry. In: Baert A and Sartot K,eds., Radiology of Osteoporosis, (pages 87–100) (Springer-Verlag), 2003.

[2] Armstrong AL and Wallace WA. The epidemiology of hip fractures and methodsof prevention. Acta. Orthop. Belg., 60(S1):85–101, 1994.

[3] Baker S and Matthews I. Equivalence and efficiency of image alignment algo-rithms. In: Computer Vision and Pattern Recognition Conference 2001, vol. 1,(pages 1090–1097). 2001.

[4] Bataur A and Hayes M. Adaptive active appearance models. IEEE Trans.Imaging Processing, 14:1707–1721, 2005.

[5] Bauer JS, Muller D, Ambekar A, Dobritz M, et al. Detection of osteoporoticvertebral fractures using multi-detector CT. Osteoporosis Int, 17:608–615, 2006.

[6] Beck TJ, Looker AC, Ruff CB, Sievanen H, et al. Structural trends in theaging femoral neck and proximal shaft: analysis of the third national healthand nutrition examination survey dual-energy X-ray absorptiometry data. JBone Miner Res, 15:2297–2304, 2000.

[7] Bhargavan M, Sunshine JH, and Schepps B. Too few radiologists ? AJR AmJ Roentgenol., 178:1075–1082, 2002.

[8] Binkley N, Krueger D, Gangnon R, Genant HK, et al. Lateral vertebral assess-ment: a valuable technique to detect clinically significant vertebral fractures.Osteoporosis Int, 16:1513–1518, 2005.

[9] Black DM, Arden NK, Palermo L, Pearson J, et al. Prevalent vertebral deformi-ties predict hip fractures and new vertebral deformities but not wrist fractures.J Bone Miner Res, 14:821–828, 1999.

[10] Black DM, Cummings SR, Karpf DB, Kauley JA, et al. Randomised trialof effect of alendronate on risk of fracture in women with existing vertebralfractures. Lancet, 348:1535–1541, 1996.

194

Bibliography

[11] Black DM, Palermo L, Nevitt MC, Genant HK, et al. Comparison of methodsfor defining prevalent vertebral deformities: The study of osteoporotic fractures.J Bone Miner Res, 10(6):890–902, 1995.

[12] Black DM, Thompson DE, Bauer DC, Ensrud K, et al. Fracture risk reductionwith alendronate in women with osteoporosis: the fracture intervention trial. JClin Endocrinol Metab, 85:4118–4124, 2000.

[13] Black MJ and Jepson AD. Eigentracking: Robust matching and tracking ofobjects using view-based representation. International Journal of ComputerVision, 26(1):63–84, 1998.

[14] Blake GM, Rea JA, and Fogelman I. Vertebral morphometry studies usingdual-energy X-ray absorptiometry. Semin Nucl Med, 27:276–290, 1997.

[15] Bosch HG, Mitchell SC, Boudewijn PF, Leieveldt PF, et al. Active appearance-motion models for endocardial contour detection in time sequences of echocar-diograms. In: SPIE Medical Imaging, (pages 257–268). 2001.

[16] Boutroy S, Bouxsein ML, Munoz F, and Delmas PD. In vivo assessment oftrabecular bone microarchitecture by high-resolution peripheral quantitativecomputed tomography. J Clin Endocrinol Metab, 90:6508–6515, 2005.

[17] de Bruijne M, Lund M, Tanko L, Pettersen PC, et al. Quantitative verte-bral morphometry using neighbour-conditional shape models. In: 9th MICCAIConference, vol. 1, (pages 1–8) (Springer-Verlag), 2006.

[18] de Bruijne M and Nielsen M. Image segmentation by shape particle filtering.In: International Conference on Pattern Recognition, (pages 722–725) (IEEEComputer Society Press), 2004.

[19] Chakraborty A and Duncan JS. Integration of boundary finding and region-based segmentation using game theory. In: Bizais Y, ed., Information Pro-cessing in Medical Imaging: Proc 14th Int. Conf (IPMI 95). In volume 3 ofComputation Imaging and Vision, vol. 3, (pages 189–200) (Kluwer AcademicPress, Dordrecht), 1995.

[20] Chapurlat RD, Duboeuf F, Marion-Audibert HO, Kalpakcioglu B, et al. Effec-tiveness of instant vertebral assessment to detect prevalent vertebral fracture.Osteoporosis Int, 17(8):1189–1195, 2006.

[21] Christensen C, Johansen JS, and Riis B, eds. Epidemiology of vertebral fractures(Copenhagen), 1987.

[22] Cohen LD and Cohen I. Finite element methods for active contour models andballoons for 2D and 3D images. IEEE Trans. on Pattern Analysis and MachineIntelligence, 15:1131–1147, 1993.

195

Bibliography

[23] Cooper C, Shah S, Hand DJ, Adams JE, et al. Screening for vertebral osteo-porosis using individual risk factors. Osteoporosis Int., 2:48–53, 1991.

[24] Cootes TF, Edwards GJ, and Taylor CJ. Active appearance models. In:Burkhardt H and Neumann B, eds., 5th European Conference on ComputerVision, vol. 2, (pages 484–498) (Springer, Berlin), 1998.

[25] Cootes TF, Edwards GJ, and Taylor CJ. A comparative evaluation of activeappearance model algorithms. In: Carter JN and Nixon MS, eds., 9th BritishMachine Vison Conference, vol. 2, (pages 557–566) (BMVA Press, Southamp-ton, UK), 1998.

[26] Cootes TF, Edwards GJ, and Taylor CJ. Active appearance models. IEEETransactions on Pattern Analysis and Machine Intelligence, 23:681–685, 2001.

[27] Cootes TF, Hill A, Taylor CJ, and Haslam J. The use of active shape models forlocating structures in medical images. Image and Vision Computing, 12(6):276–285, 1994.

[28] Cootes TF, Page GJ, Jackson CB, and Taylor CJ. Statistical grey-level modelsfor object location and identification. In: Pycock D, ed., 6th British Machine Vi-son Conference, (pages 533–542) (BMVA Press, Birmingham, England), 1995.

[29] Cootes TF, Petrovic V, Schestowitz R, and Taylor CJ. Groupwise constructionof appearance models using piece-wise affine deformations. In: 16th British Ma-chine Vison Conference, vol. 2, (pages 879–888) (BMVA Press, Birmingham),2005.

[30] Cootes TF and Taylor CJ. Combining elastic and statistical models of appear-ance variation. In: European Conference on Computer Vision, vol. 1, (pages149–163) (Springer), 2000.

[31] Cootes TF and Taylor CJ. Constrained active appearance models. In: 8thInternational Conference on Computer Vision, vol. 1, (pages 748–754) (IEEEComputer Society Press), 2001.

[32] Cootes TF and Taylor CJ. On representing edge structure for model matching.In: Computer Vision and Pattern Recognition Conference 2001, vol. 1, (pages1114–1119). 2001.

[33] Cootes TF and Taylor CJ. Statistical models of appearance for medical imageanalysis and computer vision. Proc SPIE Medical Imaging, 3:138–147, 2001.

[34] Cootes TF and Taylor CJ. Statistical models of appearance for computer vision.Tech. rep., University of Manchester, 2004.

[35] Cootes TF and Taylor CJ. An algorithm for tuning an active appearance modelto new data. In: British Machine Vision Conference, (pages 919–928). 2006.

196

Bibliography

[36] Cummings SR, Black DM, and Nevitt MC. Bone density at various sites forprediction of hip fracture. Lancet, 341:72–75, 1993.

[37] Damilakis J, Maris T, Papadokostakis G, Sideri L, et al. Discriminatory abilityof magnetic resonance T2* measurements in a sample of postmenopausal womenwith low-energy fractures: a comparison with phalangeal speed of sound anddual X-ray absorptiometry. Investigative Radiology, 39(11):706–712, 2004.

[38] Davatzikos C, Tao X, and Shen D. Hierarchical active shape models, using thewavelet transform. IEEE Trans. Med. Imag., 22(3):414–423, 2003.

[39] Davies RH, Cootes TF, Twining CJ, and Taylor CJ. An information theo-retic approach to statistical shape modelling. In: 12th British Machine VisonConference, (pages 3–12) (BMVA Press, Birmingham), 2002.

[40] Davies RH, Twining CJ, Cootes TF, Waterton JC, et al. 3D statistical shapemodels using direct optimisation of description length. In: Heyden A, ed.,9th European Conference on Computer Vision, (pages 3–20) (Springer Verlag,Berlin Heidelberg), 2002.

[41] Delmas PD, Genant HK, Crans GG, Stock JL, et al. Severity of prevalent verte-bral fractures and the risk of subsequent vertebral and non-vertebral fractures:results from the MORE trial. Bone, 33(4):522–532, 2003.

[42] Delmas PD, van de Langerijt L, Watts NB, Eastell R, et al. Underdiagnosis ofvertebral fractures is a worldwide problem: the IMPACT study. J Bone MinerRes, 20(4):557–563, 2005.

[43] Dequeker J, Gautama K, and Roh YS. Femoral trabecular patterns in asymp-tomatic spinal osteoporosis and femoral neck fracture. Clin. Radiol., 25:243–246, 1974.

[44] Dunitz M and Muenier PJ, eds. Ultrasonic evaluation of osteoporosis (Taylorand Francis), 1998.

[45] Eastell R, Cedel SL, Wahner HW, Riggs BL, et al. Classification of vertebralfractures. J Bone Miner Res, 6(3):207–215, 1991.

[46] Edwards GJ, Lanitis A, Taylor CJ, and Cootes TF. Statistical models of faceimages - improving specificity. Image and Vision Computing, 16:203–211, 1998.

[47] Ettinger B, Block JE, Smith R, Cummings SR, et al. An examination of theassociation between vertebral deformities, physical disabilities and psychosocialproblems. Maturitas, 10:283–296, 1988.

[48] Ettinger B, Genant HK, and Cann CE. Long-term estrogen replacement ther-apy prevents bone loss and fractures. Ann. Int. Med., 136:298, 1985.

[49] Evans JG. The significance of osteoporosis. In: Smith R, ed., Osteoporosis 1990,chap. 13, (pages 1–8) (Royal College of Physicians, London), 1 edn., 1995.

197

Bibliography

[50] Evans JG, Prudham D, and Wandles I. A prospective study of fractured prox-imal femur: Incidence and outcome. Public Health, 93:235–241, 1979.

[51] Faulkner K, Cummings S, Black D, Palermo L, et al. Simple measurement offemoral geometry predicts hip fracture: The Study of Osteoporotic Fractures.J Bone Miner Res, 10:1211–1217, 1993.

[52] Faulkner K, Wacker W, Barden H, Simonelli C, et al. Femur strength index pre-dicts hip fracture independent of bone density and hip axis length. OsteoporosisInt, 17:593–599, 2006.

[53] Felsenberg D and Kalender WA. Computer-assisted morphometry of vertebralfractures. In: Genant H, Jergas M, and van Kuijk C, eds., Vertebral Fracturein Osteoporosis, (pages 309–318) (University of California), 1995.

[54] Ferrar L, Jiang G, Adams J, and Eastell R. Identification of vertebral fractures:an update. Osteoporosis Int., 16:717–728, 2005.

[55] Ferrar L, Jiang G, Armbrecht G, Reid DM, et al. Is short vertebral height al-ways an osteoporotic fracture? the osteoporosis and ultrasound study (OPUS)?Bone, 41:5–12, 2007.

[56] Ferrar L, Jiang G, Barrington NA, and Eastell R. Identification of vertebraldeformities in women: comparison of radiological assessment and quantitativemorphometry using morphometric radiography and morphometric X-ray ab-sorptiometry. J Bone Miner Res, 15(3):575–585, 2000.

[57] Ferrar L, Jiang G, and Eastell R. Short-term precision for morphometric X-rayabsorptiometry. Osteoporosis Int., 12:710–715, 2001.

[58] Ferrar L, Jiang G, Eastell R, and Peel N. Visual identification of vertebralfractures in osteoporosis: using morphometric X-ray absorptiometry. J BoneMiner Res, 18(5):933–938, 2003.

[59] Frost HM. Absorptiometry and osteoporosis: problems. J Bone Miner Metab,21:255–260, 2003.

[60] Gehlbach S, Bigelow C, Heimisdottir M, May S, et al. Recognition of vertebralfracture in a clinical setting. Osteoporosis Int, 11:577–582, 2000.

[61] Geman S and McClure D. Statistical methods for tomographic image recon-struction. Bulletin of the International Statistical Institute, LII:4–5, 1997.

[62] Genant HK, Cann CE, Ettinger B, and Gordan GS. Quantitative computedtomography of vertebral spongiosa: A sensitive method of detecting early boneloss after oopherectomy. Ann. Int. Med., 97:699–705, 1982.

[63] Genant HK, Engelke K, Fuerst T, Gluer C, et al. Noninvasive assessment ofbone mineral and structure: state of art. J Bone Miner Res, 11:707–730, 1996.

198

Bibliography

[64] Genant HK, Jergas M, and van Kuijk C, eds. Vertebral Fracture in Osteoporosis(University of California), 1995.

[65] Genant HK, Wu CY, van Kuijk C, and Nevitt MC. Vertebral fracture assess-ment using a semi-quantitative technique. J Bone Miner Res, 8:1137–1148,1993.

[66] Goh S, Price RI, Song S, Davis S, et al. Magnetic resonance-based vertebralmorphometry of the thoracic spine: age, gender and level-specific influences.Clin Biomech, 15:417–425, 2000.

[67] Gordon CL, Lang TF, Augat P, and Genant HK. Image-based assessment ofspinal trabecular bone structure from high-resolution CT images. OsteoporosisInt, 8:317–325, 1998.

[68] Grados F, Roux C, de Vernejoul MC, Utard G, et al. Comparison of fourmorphometric definitions and a semiquantitative consensus reading for assessingprevalent vertebral fractures. Osteoporosis Int, 12:716–722, 2001.

[69] Gregory JS, Stewart A, Undrill PE, Reid DM, et al. Bone shape, structure, anddensity as determinants of osteoporotic hip fracture: a pilot study investigatingthe combination of risk factors. Investigative Radiology, 40(9):591–597, 2005.

[70] Gregory JS, Testi D, Stewart A, Undrill PE, et al. A method for assessmentof the shape of the proximal femur and its relationship to osteoporotic hipfracture. Osteoporosis Int, 15(4):5–11, 2004.

[71] Guermazi A, Mohr A, Grigorian M, Taouli B, et al. Identification of vertebralfractures in osteoporosis. Seminars in Musculoskeletal Radiology, 6(3):241–252,2002.

[72] Guglielmi G, Grimston SK, Fischer KC, and Pacifici R. Osteoporosis: diagno-sis with lateral and posteroanterior dual X-ray absorptiometry compared withquantitative CT. Radiology, 192:845–850, 1994.

[73] Harris C and Stephens M. A combined corner and edge detector. In: AlveyVision Conference, (pages 147–151). 1988.

[74] Harris ST, Watts NB, Genant HK, McKeever CD, et al. Effects of rise-dronate treatment on vertebral and nonvertebral fractures in women with post-menopausal osteoporosis. a randomized clinical trial. JAMA, 282:1344–1352,1999.

[75] Holbrook TL, Grazier K, Kelsey JL, and Stauffer RN. The Frequency of Occur-rence, Impact and the Cost of Musculo-Skeletal Conditions in the United States(American Academy of Orthopedic Surgeons, Chicago), 1985.

[76] Holland PW and Welsch RE. Robust regression using iteratively reweightedleast squares. Communications in statistics,, A6:813–827, 1977.

199

Bibliography

[77] Howe B, Gururajan A, Sari-Sarraf H, and Long R. Hierarchical segmentation ofcervical and lumbar vertebrae using a customized generalized hough transformand extensions to active appearance models. In: Proc IEEE 6th SSIAI, (pages182–186). 2004.

[78] Hui SL, Slemanda CW, and Johnson CC. Age and bone mass as predictors offracture in a prospective study. J. Clin. Invest., 81:1804–1809, 1988.

[79] Jiang G, Eastell R, Barrington NA, and Ferrar L. Comparison of methods forthe visual identification of prevalent vertebral fracture in osteoporosis. Osteo-porosis Int, 15(4):000–nnn, 2004.

[80] Kanis JA and Johnell O. Requirements for DXA for the management of osteo-porosis in Europe. Osteoporosis Int, 16:229–238, 2005.

[81] Kanis JA, Melton LJ, Christiansen C, Johnston CC, et al. The diagnosis ofosteoporosis. J Bone Miner Res, 9:1137–1141, 1994.

[82] Kanis JA, Oden A, Johnell O, Johansson H, et al. The use of clinical risk factorsenhances the performance of BMD in the prediction of hip and osteoporoticfractures in men and women. Osteoporosis Int, 18:1033–1046, 2007.

[83] Kass M, Witkin A, and Terzopoulos D. Snakes: Active contour models. In:1st International Conference on Computer Vision, (pages 259–268) (London),1987.

[84] Kazakia GJ and Majumdar S. New imaging techniques in the diagnosis ofosteoporosis. Rev Endocr Metab Disord, 7:67–74, 2006.

[85] Kelsey JL and Hoffman S. Risk factors for hip fracture. N. Engl. J. Med.,316(7):404–406, 1987.

[86] Krug R, Banerjee S, Han ET, Newitt DC, et al. Feasibility of in vivo structuralanalysis of high-resolution magnetic resonance images of the proximal femur.Osteoporosis Int, 16:1307–1314, 2005.

[87] Li J, Wu CY, Jergas M, and Genant HK. Comparison of semiquantitative andquantitative methods for assessment of vertebral fractures. In: Christiansen C,ed., Fourth International Symposium on Osteoporosis and Consensus Develop-ment Conference (Gardiner-Caldwell), 1993.

[88] Li J, Wu CY, Jergas M, and Genant HK. Diagnosing prevalent vertebral frac-tures: A comparison between quantitative morphometry and a standard visual(semiquantitative) approach. In: Genant H, Jergas M, and van Kuijk C, eds.,Vertebral Fracture in Osteoporosis, (pages 271–279) (University of California),1995.

[89] Lindsay R, Gallagher JC, Kleerekoper M, and Pickar JH. Effect of lower dosesof conjugated equine estrogens with and without medroxyprogesterone acetateon bone in early postmenopausal women. JAMA, 287:2668–2676, 2002.

200

Bibliography

[90] Link TM, Bauer J, Kollstedt A, Stumpf I, et al. Trabecular bone structureof the distal radius, the calcaneus, and the spine: which site predicts fracturestatus of the spine best? Investigative Radiology, 39(8):487–497, 2004.

[91] Link TM, Majumdar S, Augat P, Lin JC, et al. In vivo high resolution MRIof the calcaneus: differences in trabecular structure in osteoporosis patients. JBone Miner Res, 13:1175–1182, 1998.

[92] Majumdar S, Genant HK, Grampp S, Newitt DC, et al. Correlation of trabec-ular bone structure with age, bone mineral density, and osteoporotic status:in vivo studies in the distal radius using high resolution magnetic resonanceimaging. J Bone Miner Res, 12:111–118, 1997.

[93] Manly BFJ. Multivariate Statistical Methods, a Primer (Chapman and Hall),1986.

[94] McCloskey E, Selby P, de Takats D, Bernard J, et al. Effects of clodronateon vertebral fracture risk in osteoporosis: a 1-year interim analysis. Bone,28(3):310–315, 2001.

[95] McCloskey EV, Spector TD, Eyres KS, Fern ED, et al. The assessment ofvertebral deformity: a method for use in population studies and clinical trials.Osteoporosis Int, 3:138–147, 1993.

[96] McClung MR, Geusens P, Miller PD, Zippel H, et al. Effect of risedronate onthe risk of hip fracture in elderly women. N Engl J Med, 344:333–340, 2001.

[97] McInerney T and Terzopoulos D. Deformable models in medical image analysis:a survey. Medical Image Analysis, 1(2):91–108, 1996.

[98] McNemar Q. Note on the sampling error of the difference between correlatedproportions or percentages. Psychometrika, 12:153–157, 1947.

[99] Melton LJ. How many women have osteoporosis now? J Bone Miner Res,10:175–177, 1995.

[100] Melton LJ, Atkinson EJ, Cooper C, O’Fallon WM, et al. Vertebral fracturespredict subsequent fractures. Osteoporosis Int, 10:214–221, 1999.

[101] Melton LJ, Chrischilles EA, Cooper C, Lane AW, et al. How many women haveosteoporosis? J Bone Miner Res, 7:1005–1010, 1992.

[102] Miller CW. Survival and ambulation following hip fractures. J. Bone JointSurg., 60A:930–934, 1978.

[103] Minne HW, Leidig G, Wuster C, Siromachkostov L, et al. A newly developedspine deformity index (SDI) to quantitate vertebral crush fractures in patientswith osteoporosis. Bone Miner, 3:335–349, 1998.

201

Bibliography

[104] Nastar C and Ayache N. Fast segmentation, tracking and analysis of deformableobjects. In: 4th International Conference on Computer Vision, (pages 275–279)(IEEE Computer Society Press, Berlin), 1993.

[105] Neer RM, Arnaud CD, Zanchetta JR, Prince R, et al. Effect of parathyroid hor-mone (1-34) on fractures and bone mineral density in postmenopausal womenwith osteoporosis. N Engl J Med, 344:1434–1441, 2001.

[106] Peacock M, Turner CH, Liu G, Manatunga AK, et al. Better discrimination ofhip fracture using bone density, geometry and architecture. Osteoporosis Int.,5:167–173, 1995.

[107] Poon CS, Braun M, Fahrig R, Ginige A, et al. Segmentation of medicalimages using an active contour model incorporating region-based image fea-tures. In: Robb R, ed., Proc 3rd Conf. on Visualization in Biomedical Com-puting (VBC 94) In volume 2359 of SPIE Proc, vol. 2359, (pages 90–97)(WA:SPIE,Bellingham), 1994.

[108] Press W, Teukolsky S, Vetterling W, and Flannery B. Numerical Recipes in C(Cambridge University Press), 2 edn., 1992.

[109] Prudham D and Evans JG. Factors associated with falls in the elderly: acommunity study. Age Ageing, 10(3):141–146, 1981.

[110] Rea JA, Li J, Blake GM, Steiger P, et al. Visual assessment of vertebral defor-mity by X-ray absorptiometry : a highly predictive method to exclude vertebraldeformity. Osteoporosis Int, 11:660–668, 2000.

[111] Rea JA, Steiger P, Blake GM, Potts E, et al. Morphometric x-ray absorptiome-try: reference data for vertebral dimensions. J. of Bone and Mineral Research,13:464–474, 1998.

[112] Reginster J, Minne H, Sorensen O, Hooper M, et al. Randomized trial of theeffects of risedronate on vertebral fractures in women with established post-menopausal osteoporosis. Osteoporos Int, 11:83–91, 2000.

[113] Roberts MG, Cootes TF, and Adams JE. Linking sequences of active ap-pearance sub-models via constraints: an application in automated vertebralmorphometry. In: 14th British Machine Vision Conference, (pages 349–358).2003.

[114] Roberts MG, Cootes TF, and Adams JE. Vertebral shape: Automatic measure-ment with dynamically sequenced active appearance models. In: 8th MICCAIConference, vol. 2, (pages 733–740). 2005.

[115] Roberts MG, Cootes TF, and Adams JE. Automatic segmentation of lumbarvertebrae on digitised radiographs using linked active appearance models. In:Graham J, Thacker N, and Cootes T, eds., Medical Image Understanding andAnalysis Conference, (pages 120–124) (BMVA), 2006.

202

Bibliography

[116] Roberts MG, Cootes TF, and Adams JE. Improving the segmentation accuracyof fractured vertebrae with dynamically sequenced active appearance models.In: 9th MICCAI Conference - Workshop on joint and bone disease, (pages 1–8).2006.

[117] Roberts MG, Cootes TF, and Adams JE. Vertebral morphometry: semi-automatic determination of detailed shape from DXA images using active ap-pearance models. Investigative Radiology, 41(12):849–859, 2006.

[118] Roberts MG, Cootes TF, Pacheco EM, and Adams JE. Quantitative verte-bral fracture detection on DXA images using shape and appearance models.Academic Radiology, 14:1166–1178, 2007.

[119] Ross PD, Davis JW, Vogel JM, and Wasnich RD. A critical review of bonemass and the risk of fractures in osteoporosis. Calcif. Tissue Int., 46(3):149–161, 1990.

[120] Rousseeuw PJ and Croux C. Alternatives to the median absolute deviation. JAmer Stat Assn,, 88:1273–1283, 1993.

[121] Sclaroff S and Isidoro J. Active blobs. In: International Conference on Com-puter Vision (ICCV 98), (pages 1146–1153) (Springer), 1998.

[122] Scott IM. Searching Image Databases using Appearance Models,PhD Thesis,chap. Further Experiments with Texture AAMs, (pages 135–138) (Division ofImaging Science and Biomedical Engineering, University of Manchester), 2004.

[123] Scott IM, Cootes TF, and Taylor CJ. Improving active appearance modelmatching using local image structure. In: 18th Conference on InformationProcessing in Medical Imaging, (pages 258–269). 2003.

[124] Singh M, Nagrath AR, and Maini PS. Changes in trabecular pattern of theupper end of the femur as an index of osteoporosis. J Bone Joint Surg, 52:437,1970.

[125] Smyth PP. Measurement of osteoporosis using computer vision,PhD Thesis(Department of Medical Biophysics, University of Manchester), 1997.

[126] Smyth PP, Taylor CJ, and Adams JE. Vertebral shape: automatic measurementwith active shape models. Radiology, 211:571–578, 1999.

[127] Sorenson JA and Cameron JR. A reliable In Vivo measurement of bone mineralcontent. J. Bone Joint Surg., 49:481–497, 1967.

[128] Staib LH and Duncan JS. Left ventricular analysis from cardiac images usingdeformable models. Proc Computers in Cardiology, (pages 427–430), 1989.

[129] Ste-Marie LG, Sod E, Johnson T, and Chines A. Five years of treatmentwith risedronate and its effects on bone safety in women with postmenopausalosteoporosis. Calcif Tissue Int, 75:469–476, 2004.

203

Bibliography

[130] Steiger P, Cummings SR, Genant HK, Weiss H, et al. Morphometric X-ray ab-sorptiometry of the spine: Correlation in vivo with morphometric radiography.Osteoporosis Int., 4:238–244, 1994.

[131] Szulc P and Delmas PD. Vertebral Fracture Initiative Resource Document (In-ternational Osteoporosis Foundation), 2005.

[132] Szulc P and Delmas PD. Vertebral Fracture Initiative Resource Pack (Interna-tional Osteoporosis Foundation), 2005.

[133] Tomomitsu T, Murase K, Sone T, and Fukunaga M. Comparison of verte-bral morphometry in the lumbar vertebrae by T1-weighted sagittal MRI andradiograph. European Journal of Radiology, 56:102–106, 2005.

[134] Torgerson DJ and Bell-Syer SEM. Hormone replacement therapy and preven-tion of non-vertebral fractures. a meta-analysis of randomised trials. JAMA,285:2891–2897, 2001.

[135] Turk M and Pentland A. Eigenfaces for recognition. Journal of CognitiveNeuroscience, 3:71–86, 1991.

[136] Vapnik V. The nature of statistical learning theory (Springer-Verlag), 1995.

[137] Wasnich RD, Ross PD, Heilbrun LK, and Vogel JM. Prediction of post-menopausal fracture risk with use of bone mineral measurements. Am. J.Obstet. Gynecol., 153(7):745–751, 1985.

[138] Wilkins C and Birge S. Prevention of osteoporotic fractures in the elderly. Am.J. Med., 118:1190–1195, 2005.

[139] Williams AL, Al-Busaidi A, Sparrow PJ, Adams JE, et al. Under-reporting ofosteoporotic vertebral fractures on computed tomography. European Journalof Radiology, (page In press), 2008.

[140] Wilson CR and Matson M. Dichromatic absortiometry of vertebral bone min-eral content. Invest. Radiol., 12:188–194, 1977.

[141] Wu CY, Li J, Jergas M, and Genant HK. Semiquantitative and quantitativeassessment of incident fractures: comparison of methods - abstract. J BoneMiner Res, 9(Suppl 1):S157, 1994.

[142] Wu CY, Li J, Jergas M, and Genant HK. Comparison of semiquantitative andquantitative methods for the assessment of prevalent and incident vertebralfractures. Osteoporosis Int, 5:354–379, 1995.

[143] Zamora G, Sari-Sarraf H, and Long R. Hierarchical segmentation of vertebraefrom X-ray images. Med Imaging: Image Process, Proc of SPIE, 5032:631–642,2003.

204

Bibliography

[144] Zweig MH and Campbell G. Receiver-operating characteristic (ROC) plots: afundamental evaluation tool in clinical medicine. Clinical Chemistry, 39:561–577, 1993.

205

Documents

Automatic Detection and Classiﬁcation of Vertebral …Classiﬁcation of Vertebral Fracture using Statistical Models of Appearance A thesis submitted to the University of Manchester