Structure Features for Content-Based Image Retrieval and

Structure Features for Content-Based

Image Retrieval and Classification

Problems

Dissertation

zur Erlangung des Doktorgrades

der Fakultat fur Angewandte Wissenschaften

an der Albert-Ludwigs-Universitat Freiburg im

Breisgau, Deutschland

von

Dipl. Phys. Gerd Brunner

2006

ii

Dekan: Prof. Dr. Bernhard Nebel

Prufungskommission: Prof. Dr. Rolf Backofen (Vorsitz)Prof. Dr. Stephane Marchand-Maillet (Prufer)Prof. Dr. Hans Burkhardt (Betreuer)Prof. Dr. Thomas Ottmann (Beisitz)

Datum der Disputation: 30. Jannuar 2007

Acknowledgements

This dissertation is a result from my work at the Chair of Pattern Recognition and Im-

age Processing, Institute for Computer Science of the Albert-Ludwigs-Univeristy Freiburg,

Germany. I am most grateful to my advisor Prof. Dr.-Ing. Hans Burkhardt who has always

supported my research and gave me the opportunity to work at his chair. I would also

like to thank Prof. Stephane Marchand-Maillet from the University of Geneva for being

the coreferee of my thesis. In addition, I would like to thank all my colleagues from the

chair for fruitful discussions and support. I am most grateful to Cynthia Findlay for her

constant encouragement and advises. I would like to express my thanks to Stefan Teister

and his administration team for keeping my computer running. I am greatly indebted to

my wife Kasia for her love and understanding as well as for her contributions to this work.

Freiburg, September 2006

Gerd Brunner

iv

Zusammenfassung

Wahrend der vergangenen Jahrzehnte konnte man einen stetigen Zuwachs an Bildmaterial

verfolgen, der gigantische Archive hervorbrachte. Inhaltsbasierte Bildsuchealgorithmen

haben versucht den Zugriff auf Bilddaten zu vereinfachen. Heutzutage gibt es zahlreiche

Merkmalsextraktionsverfahren, die entscheidend zur Qualitatsverbesserung von inhalts-

basierten Bildsuche- und Bildklassifizierungssystemen beigetragen haben.

Struktur ist eines der wichtigsten Merkmale der Bildanalyse. Das beruht auf der

menschlichen Wahrnehmung von Objekten und Szenen, die zum Grossteil auf spezielle

raumliche Konfiguration und Anderungen der Intensitat basiert. In dieser Dissertation

fuhren wir eine strukturbasierte Merkmalsextraktionsmethode ein. Die Methode kann glob-

ale Strukturen als auch lokale perzeptuelle Gruppen und deren Relationen reprasentieren.

Der Vorteil der Methode ist ihre breite Anwendbarkeit und Invarianz gegen Ahnlichkeits-

transformationen und gegen Anderung der Beleuchtung. Erstens wird die Erzeugung von

Kanten mittels mehrerer Kantendetektoren behandelt. Aus diesem Grund wird ein Algo-

rithmus vorgestellt, der automatisch die besten Parameter fur den Canny Kantendetektor

berechnen kann. Zweitens wenden wir eine Liniengruppierungsmethode, die auf agglomer-

ativen hierarchischem Clustering beruht an. Dazu werden die Linensegmente mittel einer

Kantentracking Methode extrahiert. Das Clustering Verfahren evaluiert die beste Linkage

Methode und schneidet das Dendrogramm automatisch ab. Nachdem die finale Cluster-

Hierarchie erzeugt ist, werden die weniger signifikanten Cluster, basierend auf einem Kom-

paktheitsmaß, verworfen. Drittens berechnen wir strukturbasierte Merkmale auf globalen

und lokalen Skalen. Die globale Skala versichert eine holistische Szenenanalyse des Bildes.

Im Gegensatz dazu kodieren die lokalen Merkmale perzeptuelle Gruppen und deren Re-

lationen. Abschließend werden die strukturbasierten Merkmale auf Binar- und Farbbild,

Objektklassen und Textur Retrieval und/oder Klassifikation angewandt. Die erste An-

wendung ist die Klassifikation und inhaltsbasierte Suche von historischen Wasserzeichen.

Die zweite Anwendung ist die Bildsuche in zwei Farbbilddatenbanken mit 1.000 und 10.000

Bildern der Corel Kollektion. Die Ergebnisse werden zusammen mit einer Invarianzanalyse

prasentiert, wobei die vorgestellten Merkmale eine Invarianz von uber 96% erzielen kon-

nten. Die dritte Anwendung beruht auf der Erkennung und dem Retrieval von Objekt-

klassen der Caltech Datenbank, wobei eine Erkennungsrate von 92.45% bzw. 95.45%, fur

vi

die Funf- und Dreiklassenprobleme erzielt werden konnte. Die vierte und finale Anwendung

ist die Klassifikation von Texturen der Brodatz Kollektion. Eine Support-Vektormaschine

mit einem Intersection-Kernel konnte unter der Verwendung des leave-one-out Tests eine

Klassifikationsrate von 98% erzielen.

Abstract

During the past decades we have been observing a permanent increase in image data, lead-

ing to huge repositories. Content-based image retrieval methods have tried to alleviate the

access to image data. To date, numerous feature extraction methods have been proposed

in order to improve the quality of content-based image retrieval and image classification

systems.

Structure is one of the most important features for image analysis as shown by the fact

that the human perception of objects and scenes is to a large extent based on particular

spatial configurations and changes in intensity.

In this thesis we introduce a structure-based feature extraction technique. The method

is capable of representing the global structure of an image, as well as local perceptual groups

and their connectivity. The advantage of the method is its broad range of applications and

its invariance against changes in illumination and similarity transformations.

We first discuss the creation of edge maps accompanied by an evaluation of various

edge detectors. Therefore, we present a method that automatically computes the best set

of parameters for the Canny edge detector.

Secondly, we apply a line segment grouping method based on agglomerative hierarchical

clustering, where the segments are extracted with an edge point tracking algorithm. The

procedure automatically evaluates the best linkage method and prunes the dendrogram

based on a subgraph distance ratio. Once the final clusters are obtained, an intra-class

compactness measure is used to discard less significant segment groups.

Thirdly, the structure-based features are computed on a global and local scale. The

global scale ensures a holistic scene analysis of an image, whereas the local features account

for perceptual groups and their connectivity.

Finally, we apply the structure-based features to tasks as broad as binary, color, ob-

ject class and texture image retrieval and/or classification. The first application is the

classification and content-based image retrieval of ancient watermark images. The second

application is a retrieval task of two color image databases from the Corel collection with

1.000 and 10.000 images. The results are accompanied by an invariance analysis, where

our features have obtained a score of more than 96%. The third application is object class

recognition and retrieval for the Caltech database, where we achieve a classification rate

viii

of 92.45% and 95.45% for the five and three class problem, respectively. The fourth and

final application is the classification of textures obtained from the well known Brodatz

collection. A support vector machine with an intersection kernel and a leave-one-out test

obtained a classification rate of 98%.

Contents

1 Introduction 1

1.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Edge Detection 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Canny Edge Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Performance Evaluation and Parameter Selection . . . . . . . . . . . . . . 14

2.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Line-Segment Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Hough Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Edge Point Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Clustering 27

3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 The Cophenetic Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Cluster Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 Cluster Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.2 Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.3 Cutting a Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Structure-Based Features 43

4.1 Euclidean Distance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Feature Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Global Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.2 Local Perceptual Features . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

x CONTENTS

4.3 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5 Feature Space Representation . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Applications 67

5.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.1 Ancient Watermark Images . . . . . . . . . . . . . . . . . . . . . . 67

5.1.2 Corel Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.3 Caltech Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.4 Brodatz Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.1 Measures for Classification . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.2 Measures for Content-based Image Retrieval . . . . . . . . . . . . . 77

5.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4 Filigrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.2 Retrieval Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.3 Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.4.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5 Color Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


5.5.2 Invariance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.6 Object Class Recognition and Retrieval . . . . . . . . . . . . . . . . . . . . 111

5.6.1 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111


5.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.7 Texture Class Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6 Conclusions and Perspectives 127

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Outlook and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

A Appendix 131

A.1 Mercer Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

CONTENTS xi

Bibliography 132

xii CONTENTS

List of Figures

1.1 Content-based image retrieval system chart. . . . . . . . . . . . . . . . . . 5

2.1 Line profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Comparison of five edge detectors, with default parameters. Visually, the

Canny method performs best. . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Typical examples of edge profiles and their derivatives. . . . . . . . . . . . 13

2.4 Edge maps for the Canny edge detector with various sets of parameters. The

leftmost column shows gray scale images. The second column resembles the

ground truth images. The third column displays the best edge maps we

could have automatically obtained. The parameters are listed in the first

column of Table 2.3. The last column depicts sample edge maps obtained

by less performant Canny parameters, that are typically in the order of:

σ ≥ 3.5 and [θl, θu] approaching 1. . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Left panel shows a straight line in the Cartesian space. In addition, r and

θ are plotted to show the parameterization of the line. Right panel the

Hough transformation of the single line from the left. Note, that the pattern

originates from the quantization in the Hough space. . . . . . . . . . . . . 22

2.6 Comparison of line segments obtained from the Hough-transform and the

edge point tracking algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Flow chart of the agglomerative hierarchical clustering algorithm. . . . . . 29

3.2 Various dendrograms obtained from six different linkage methods. . . . . . 32

3.3 The two graphs show dendrograms where the ellipse in the left one encloses

a typical subgraph. SD12 and SD23 are the distances between nodes in the

enclosed subgraph. The right dendrogram pictures the actual pruning for

the segments of the temple image (see Figure 3.4). . . . . . . . . . . . . . . 39

xiv LIST OF FIGURES

3.4 The first row shows a color image and all line segments. The second row

presents line segment clusters, where the left graph illustrates noise-like

segments (non-compact clusters). Whereas the right graph shows a salient

subset of line segments according to the compactness measure (see Equa-

tion 3.20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 A graphical illustration of a set of points and three corresponding transfor-

mations (translation, rotation and scaling). Note, that we only show the

EDM elements (e12 to e15) for the point P1. . . . . . . . . . . . . . . . . . 45

4.2 Illustration of line segment properties. . . . . . . . . . . . . . . . . . . . . 49

4.3 Sample groups of line segments that follow certain constraints (see Equa-

tions 4.20 to 4.24 and Equations 4.31 to 4.35). . . . . . . . . . . . . . . . . 54

4.4 Local perceptual groups of line segments based on Equations 4.31 to 4.35

and Equations 4.36 to 4.39. . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5 Class-wise averaged recall versus number of retrieved images plot with eight

different similarity measures for two classes of the ancient Watermark database

(see Section 5.1.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1 Samples of scanned ancient watermark images (courtesy Swiss Paper Mu-

seum, Basel). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Sample filigrees of each class from the Watermark database after enhance-

ment and binarization (see [112]). The classes are according to Table 5.1

(starting from top left): Eagle, Anchor1, Anchor2, Coat of Arm, Circle, Bell,

Heart, Column, Hand, Sun, Bull Head, Flower, Cup and Other objects. . . 70

5.3 Sample filigrees from the Watermark database after enhancement and bina-

rization (see [112]). Each of the four rows show watermarks from the same

class, namely Heart, Hand Eagle and Column. The samples show the large

intra-class variability within the Watermark database. . . . . . . . . . . . . 71

5.4 Sample images from the two Corel databases, where the images in the first

row are randomly taken from the 1.000 image set and the images in the

second row from the 10.000 image database. . . . . . . . . . . . . . . . . . 73

5.5 Two sample images of each class from the Caltech database. . . . . . . . . 74

5.6 Sample images from the 13 classes of the Brodatz image database. . . . . . 75

5.7 Sample retrieval result obtained with our structure-based features (see Sec-

tion 4.2) from the class Anchor1 of the Watermark image database. . . . . 84

5.8 Sample retrieval result of the class Circle from the Watermark database,

under the usage of global and local structural features (see Section 4.2). . . 86


tion 4.2) from the class Column of the Watermark database. . . . . . . . . 87

LIST OF FIGURES xv

5.10 Sample retrieval result of the class Flower from the Watermark database,

under the usage of our structural features (see Section 4.2). . . . . . . . . 88


tion 4.2) from the class Eagle of the Watermark database. . . . . . . . . . 89


tion 4.2) from the class Eagle of the Watermark database. . . . . . . . . . 90

5.13 Sample retrieval result of the class Coat of Arms from the Watermark

database. As features we have incorporated our global and local structre-

based features (see Section 4.2). . . . . . . . . . . . . . . . . . . . . . . . . 91

5.14 Figure 5.14: Class-wise recall vs. number of

images retrieved graphs for the Watermarks. . . . . . . . . . . . . . . . . . 93

5.15 Partial matching result obtained from the Watermark database with our

structural features (see Section 4.2) and under the usage of the intersection

similarity measure. The query image resembles a cut of a filigree from the

class Coat of Arms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.16 Partial matching result obtained from the Watermark database with our

structural features (see Section 4.2) and under the usage of the intersection

similarity measure. Note, that the query resembles a head of an eagle that

belongs to the class Eagle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.17 The seven images show examples of watermarks with ambiguities in respect

to their ground truth class membership and their real content. The labels

below each image show the watermark class. Image 5.17(a) belongs to the

class Heart, although there is just a tiny heat in the center of the watermark.

In fact, it looks more like a Coat of arms. A similar argumentation holds for

image 5.17(c). Note, the embedded eagle. Specifically, 5.17(c) and 5.17(d)

were classified as Eagle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.18 Sample image retrieval results obtained from the 1.000 image collection. As

features we have used block-based features (see Section 5.5.1), and the global

structure based features (see Section 4.2.1). The first image represents the

query and the images are arranged in decreasing similarities from left to

right indicated by the numbers above the images. Note, that 1 denotes an

identical match with the query image. . . . . . . . . . . . . . . . . . . . . . 103

5.19 Sample retrieval results obtained from the 10.000 image collection.As fea-

tures we have used block-based features (see Section 5.5.1), and the global

structure based features (see Section 4.2.1). The first image represents the

query and the images are arranged in decreasing similarities from left to right

indicated by the numbers above the images, where 1 denotes an identical

match with the query image. . . . . . . . . . . . . . . . . . . . . . . . . . . 104

xvi LIST OF FIGURES

5.20 Average class-wise recall versus the number of retrieved images graph for the

1.000 image database. All features are plotted for class-wise comparisons.

Note that the curves represent averaged quantities, i.e. each class member

was taken as a query image and the resulting graphs were averaged. . . . . 106

5.21 Precision-recall graph for the 1.000 image data-set, where the graph is aver-

aged over all images and classes representing an overall performance measure.107

5.22 Precision-recall graph for the 10.000 image database, where the graph is

averaged over all images and classes representing an overall performance

measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.23 Robustness analysis of the 1.000 image database for different brightness and

rotation conditions (brightness 30 % increased, saturation 10 % decreased).

The ordinate shows the degree of averaged invariance for each feature set,

with 1 being 100 % invariant. . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.24 Figure 5.24: Result obtained with structure

features (see Section 4.2), from the Caltech database. . . . . . . . . . . . . 115

5.25 Figure 5.25: Result obtained with structure

features (see Section 4.2), from the Caltech database. . . . . . . . . . . . . 116

5.26 Class-wise averaged precision-recall graphs for the Caltech image database. 117

5.27 Figure 5.27: Result obtained with structural

features (see Section 4.2) for some transformations. . . . . . . . . . . . . . 119

5.28 Figure 5.28: Result obtained with structural

features (see Section 4.2) for some transformations. . . . . . . . . . . . . . 120

5.29 Precision-recall graph for several image transformations for the query image

of Figures 5.27 and 5.28 that belong to the class Motorbikes. . . . . . . . . 121

List of Tables

2.1 Number of ground truth pixels for sample images. . . . . . . . . . . . . . . 16

2.2 Falsely detected edge pixels for the ten best sets of parameters [in %]. The

two right most columns indicate the mean error for the best ten and best five

sets, respectively. The second part of the table lists the absolute numbers

of falsely detected edge pixels. . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 The parameters [θl, θu, σ], for the ten best edge maps. At the end of the

table we list the averaged values. . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Averaged cophenetic correlation coefficients for six linkage methods. . . . . 37

5.1 The classes of the Watermark database. . . . . . . . . . . . . . . . . . . . . 72

5.2 Sample confusion matrix for a two class problem. . . . . . . . . . . . . . . 76

5.3 Class-wise performance measures for the Watermark database. A detailed

description can be found in Section 5.2.2 and in the text. . . . . . . . . . . 94

5.4 Confusion matrix for the Watermark database. . . . . . . . . . . . . . . . . 98

5.5 Class-wise true positive (TP) and false positive (FP) rates for the Watermark

database, where the first column indicates the correctly classified images and

the total number of class members. The second column shows the TP rate

in [%]. Column three represents all FP obtained and column four gives the

FP rate in [%]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.6 Detailed performance measures for the Watermark database. The measures

are explained in 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.7 Averaged feature invariance representation. First column: Invariance under

different brightness conditions (brightness 30 % increased, saturation 10 %

decreased). Second column: Invariance under different brightness and rota-

tion. Third column: Invariance between: different brightness conditions and

different brightness plus rotation. LS: Line segment feat., BBF: Block-based

feat., IFH: invariant feature histogram. . . . . . . . . . . . . . . . . . . . . 110

5.8 Confusion matrix for the Caltech database. . . . . . . . . . . . . . . . . . . 112

xviii LIST OF TABLES

5.9 Class-wise true and false positive rates for the Caltech database, where the

first column indicates the correctly classified images and the total number

of class members. The second column shows the TP rate in [%]. Column

three represents all FP obtained and column four gives the FP rate in [%]. 112

5.10 Detailed performance measures for the Caltech database (see Section 5.2). 113

5.11 Comparison of our structure-based method with others from the literature. 113

5.12 Confusion matrix for the Brodatz database. . . . . . . . . . . . . . . . . . 122

5.13 True positive and false positive rates for the Brodatz database, where the

first column indicates the correctly classified images and the total number

of class members. The second column shows the TP rate in [%]. Column

three represents all FP obtained and column four gives the FP rate in [%]. 123

5.14 Detailed performance measures (see Section 5.2.1), for the Brodatz database. 124

5.15 Comparison of the structure-based method with others from the literature

for the Brodatz database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Chapter 1

Introduction

Digital images play an important role in our everyday life. In many areas of science,

commerce and government images are daily acquired and used. During the past decades

we have been observing a permanent increase of image data, leading to huge repositories.

The national geographic imagery archive of the United States currently has a size in the

range of Petabytes (PB) and daily grows by several Terabytes (TB). The commercial

Internet search engine Yahoo claims to have indexed about 1.6 billion images. This rapid

evolution triggers the demand of qualitative and quantitative image retrieval systems. To

date, various commercial systems and Internet search engines feature keyword-based image

retrieval. Usually, the results are not satisfactory due to the high complexity of images

that can not easily be described by words. Therefore, content-based image retrieval gained

in importance during the past decade. The areas of application are as wide as but not

limited to:

• E-commerce solutions (support client online product search, e.g. retrieve all similar

pictures of a specific couch. Possible interest groups are online-stores)

• Authors, editors, lecturers or journalists in order to improve their working efficiency

(retrieval by subject or categorical search, e.g. an author writes an article about

Rome; he wants to find all images of St.Peter’s Cathedral)

• Private users (retrieve personal images acquired with their own digital camera)

• Surveillance systems (search for a certain person, a car or other items; possible areas

of application are casinos, traffic surveillance, access controls, crime prevention and

others)

• Aviation, space and military applications, e.g. navigation systems, satellite and

imagery (object and structure recognition, e.g. buildings, road network)

2 Introduction

• Medicine (analysis of X-rays, Magnetic Resonance Imaging (MRI) - might help for

diagnostics, tissue classification)

• Biological image processing (e.g. pollen recognition, cell detection and recognition)

• Art collections, museums (e.g. retrieval of painting styles)

• Watermarks and filigrees (retrieval of ancient watermarks)

• Autonomous robot navigation (e.g. obstacle detection)

The field of image retrieval has also triggered advancements in the related area of image

classification. One of the main differences between image classification and retrieval is

that the latter one imposes a kind of similarity ranking whereas the first one distinguishes

between classes rather then between single images. The broad range of image retrieval and

classification applications demands general as well as highly specialized systems equipped

with image features such as for example color, texture or structure-based ones. However, in

this thesis we focus on image classification and content-based image retrieval with structure-

based features.

1.1 Contributions of this Thesis

The aim of this thesis is the investigation of discriminative structure-based features for

image classification and content-based image retrieval problems. Moreover, the method

developed in this work contributes to the state-of the art in (structure-based) image re-

trieval. In the following we will list the main contributions of this thesis:

• Edge detection and verification: We show that the Canny edge detector is most

suitable for deriving structure-based image features. We present a method that auto-

matically computes a set of Canny parameters for real world images. Therefore, we

evaluate 550 different parameter sets and determine the best one by the comparison

with a manually generated ground truth. The best parameter set produces an error

rate of only 0.1 to 1.3%. Finally we define a range for the three parameters, based

upon the ten best sets.

• Hierarchical Clustering: We present a straight line segment clustering method.

Firstly, we determine the existence of an underlying clustering structure in the feature

space with the Hopkins test. A result of h=0.81 suggests the adequateness of our

clustering method with a confidence level of more then 90%. Secondly, we evaluate

the most suitable linkage method for the agglomerative hierarchical clustering by

computing the cophenetic correlation coefficient for 15972 different hierarchies. With

1.2 Content-Based Image Retrieval 3

an average score of 0.9262 the average linkage method produces the best result.

Thirdly, we introduce a subgraph distance ratio that is used in order to prune or cut

a dendrogram. The resulting groups of straight line segments are divided into salient

and less important clusters on the basis of the intra-class compactness.

• Structure-Based Features: We have developed a method that represents the

global structure of an image, as well as the local structure of perceptual groups

and their connectivity. The relations between perceptual groups are computed under

the usage of Euclidean distance matrices. The features are invariant against similar-

ity transformations and robust against changes in illumination. The method can be

applied to a broad range of applications.

• Classification and Content-Based Image Retrieval of Ancient Watermark

Images: The aim of this task is the retrieval and classification of ancient watermarks

(14 classes). The retrieval results are presented with averaged class-wise precision-

recall graphs that show the discrimination ability of the structure-based feature set.

For the classification we use a support vector machine with an intersection kernel.

The results are obtained by a leave-one-out evaluation and show that for the 14 class

problem an average true positive rate of 87.41% could have been achieved.

• Object Class Recognition and Retrieval: The next contribution is object class

recognition of the five classes from the Caltech database consisting of more than

2800 images. We apply a set of structure-based features and compare the results

with scores from the literature. The results show a correct classification rate of more

then 95%.

• Color Image Retrieval: Structure-based features are applied to two Corel image

dataset of 1.000 and 10.000 images. The results are compared with two state-of-the

art methods and presented with averaged precision-recall graphs. The analysis is

completed with an investigation of the feature robustness under several transforma-

tions for all presented methods. The result shows that the proposed features are

invariant (96.56%) against rotation and brightness transformations.

• Texture Class Recognition: We perform a classification task on texture samples

from the Brodatz benchmark database. The structure-based features obtain an aver-

age classification rate of 98%, that is in the same range as the best scores published.

1.2 Content-Based Image Retrieval

There have been great undertakings to create text-based image search engines by several

companies e.g. Google, Yahoo or Microsoft. Although the pure text-based search performs

4 Introduction

already quite well, the image search results are not satisfactory at all, since text cannot

sufficiently describe an image’s content, e.g. imagine an expressionistic painting. Hence,

image retrieval systems solely based on textual search often perform poorly. Content-based

image retrieval (CBIR) on the other hand enables us to search an image by its real content.

Figure 1.1 depicts a chart of a CBIR system.

A wide variety of models have been considered for image retrieval [19] from the very

simple early vision techniques, like color histograms [133], to very sophisticated methods,

applicable for specific and restricted environments. The various kinds of CBIR features

can be grouped into the following categories:1 color with/without spatial constraints [141]

[130], texture [43] [26] [53] [84]2 [128] [127], shape and structure [61] [91] [4] [13] [14].

In the past, the scientific community propagated three types of content based image

retrieval approaches: semi-automatic extraction of attributes, automatic feature extraction

and object recognition,

Semi-automatic systems [27] provide tools to accelerate the annotation process; how-

ever, they require manual interaction for database generation. The user might manually

(probably computer assisted) segment objects in the image, followed by an automatic

analysis and annotation by the computer. Such systems considerably accelerate the image

annotation. Semi-automatic systems can only work off line, i.e., a database has to be

generated before queries can be provided. Hence, they are not well suited for heavily fluc-

tuating data like for example a collection of Internet sites. In addition these systems are

limited in size due to the fact that new image data might be faster acquired than manual

annotations can be added.

Fully automatic systems [131][129] overcome these problems since the analysis can be

done at query time, if the data set has not been analyzed before. Nonetheless, for unre-

stricted image retrieval problems, there is a large gap between objective image features,

that have been extracted, and the semantics of an image.

A necessary pre-requisite of features is invariance; no matter whether they are local,

global or semantic. Invariant features remain unchanged in case the image content is

transformed according to a group action, i.e. the features obtained for an unaltered or

from a transformed image are mapped to the same point in feature space. A simple

example is the color histogram of an image that remains identical under any permutation

of the image pixels. However, a slight change in illumination may significantly change a

simple color histogram.

The invariance property might simplify the comparison of images. Clearly, does the

invariance property average an object’s content in terms of feature space representation.

Therefore, each invariant feature should be able to describe various transformations of

1Note, that we do not intent to give a comprehensive survey on CBIR techniques and systems. However,Veltkamp et.al. [142] gives a good overview of CBIR systems.

2The authors combine texture and color features for the usage of image classification.

1.2 Content-Based Image Retrieval 5

Figure 1.1: Content-based image retrieval system chart.

6 Introduction

an object, while remaining a unique descriptor. Hence, we will evaluate the invariance

properties of the proposed structure-based features and the consequences of retrieving

similar images.

Many state of the art algorithms base their features on low-level properties. Though

these features often give good results, they clearly lack higher-level interpretation and

knowledge on an image’s content. Recent research initiatives have begun to focus on se-

mantic meanings and conceptual representations of image features by incorporating prob-

lem specific techniques [120], structure-based knowledge on levels of different granularity

or scene content understanding [29]. It is inevitable for the further progress of CBIR al-

gorithms to integrate semantics or conceptual descriptors on various levels to sufficiently

represent an image’s content.

The structure-based algorithm investigated in this thesis focuses on image semantics in

order to contribute to the state of the art in image retrieval [49]. Achieving progress in the

area of semantic image retrieval requires solving the so-called semantic gap [29] in parts

or as a whole. There are various definitions for the semantic gap in CBIR. However, the

following definition gives a good abstraction of the problem:

The rift between meaningful descriptors that users expect CBIR applications

to identify with image content, and the features current state of the art CBIR

systems are able to compute.

Indeed, there is a huge ”gap” between the human perception and recognition of complex

scenes and today’s limited features to describe them. Though, today’s research is far from

solving the semantic gap in general - good results have been reached on special semantics

such as equivalence or object [50] classes, e.g. classification of buildings [13].

However, in the sequel of this thesis we will show that structure-based knowledge clearly

advances image classification and retrieval tasks. Structure-based information can take on

many forms, such as from fine granular to rough and coarse scales. Structure is usually

manifested in areas of high contrast leading to visually well distinguished image areas. The

aim of this work is to extract these information and describe them with proper features on

local and global scales.

1.3 Related Work

In this section we give a review of related structure-based feature extraction methods in

the area of image classification and content-based image retrieval. Structure-based features

have been frequently used for image representation, classification and content-based im-

age retrieval tasks [58]. Moreover, the authors in [57] have successfully combined texture

with structure and color features. Their findings are not unexpected since the nature of

1.3 Related Work 7

the features is somewhat complementary to each other, as they focus on different aspects

of an image’s content. In the work of [58], the authors use perceptual grouping features

to represent the structural image content. Their structure-based method extracts a fea-

ture vector containing the number of lines in so-called ”L”-, ”U”-junctions and significant

parallel groups and polygons. The authors argue that their representation of structure

information serves well for man-made objects. Their quantitative statistical analysis of

line segments lacks relative spatial knowledge of distant segments, which are suitable for

describing a complete image scene. Our approach is different, whereas we encode relative

arrangements of line segments obtained by a hierarchical clustering. Our structure-based

method encodes the relative position, orientation and distance from a set of line segments

to all others in a histogram representation.

The authors in [61] utilized the edge direction histogram to represent the shape of

images. In order to achieve scale invariance, each histogram was normalized with respect

to the number of edge points in the image. In [83], a high-level image feature called

consistent line clusters was introduced, which is a set of lines that is homogeneous in

terms of its characteristics. The algorithm was applied to a building recognition task

with satisfactory results. Another building classification algorithm based on weighted line

segment arrangements was published in [13]. The authors adopted the connectivity of

parallel and perpendicular segments in order to classify buildings. The work of [14] shows

that the clustering of line segments can be used for CBIR applications. The method

described incorporates a hierarchical clustering algorithm with single linkage, where the

authors created a 128-bin feature histogram from the clustering result.

The work of [56] presents a comparison between structure-based and texture-based

CBIR systems, with respect to man-made object retrieval. Perceptual grouping3 features

were used to represent an image’s structure. The results reveal that the usage of a-priori

knowledge of man-made buildings succeeded in a better performance than gray-scale his-

tograms and Gabor texture features.

In [54] a shape histogram is introduced, which is constructed from geometric attributes

and structural information. Line patterns based on line segments are extracted and rep-

resented by a N-nearest neighbor graph. The set of edges extracted from each N-nearest

neighbor graph are used to construct a histogram. The actual matching process was per-

formed with the Bhattacharyya distance on a trademark and logo database. The results

revealed that the method is robust against missing lines and line-splitting, but additional

line segments drastically reduce the performance. In addition, the authors presented a

graph-matching approach which performed better than the histogram method. However,

due to its extreme long matching times it is not applicable for real time retrieval tasks.

The authors of [91] describe an edge-based feature for shape recognition, where a model

3A detailed description of the perceptual grouping features may be found in [58].

8 Introduction

of an object is learnt and used for recognition of the same object in different images. In [152]

another edge-map based feature was proposed for CBIR applications. The actual feature

is computed by applying a waterfilling algorithm to the initial edge-map, which results

in a structure representation of an image. The results show, that edge-based features are

useful and can improve CBIR results. An edge-based shape feature was presented in [4]

[75], where contour edge point relations are encoded representing the entire object shape.

The feature performed well on recognition tasks within object databases such as COIL-100

(household objects photographed on a turntable with 5 degrees increment and constant

background) and MNIST (dataset of handwritten digits).

The article in [121] presents a structure-based feature named image context vector. The

features are computed from line segments, where short segments were omitted. In fact, only

four orientations were used for the angular segment representation. The authors observed

a robustness with respect to scaling, translation and noise. However, the approach suffers

from not being invariant against rotation and illumination.

In [33] the authors present a model-based approach to 3D building extraction from

aerial images. The model consists of a hierarchical set of parts, representing subparts of

a building. The work of [37] approached the problem of building extraction from aerial

imagery by incorporating geometric moments. The method extracts regions containing

buildings and fits a polygon to the area of interest. However, the actual features are

geometric moments computed from the closed contour of the fitted polygon.

The review of the literature shows that an image’s structure can be used to extract

salient and powerful features for CBIR applications.

Chapter 2

Edge Detection

2.1 Introduction

In this chapter we will discuss the problem of the edge detection that is regarded as a

subject of fundamental importance in image processing. During the last decades the image

processing community has spend huge efforts to cope with the task of edge finding. The

usage of edge finding methods is mainly driven by the need of revealing structural informa-

tion. Usually, edge detection is one of the initial steps in the sequel of the image processing

task. That is also the case in our problem - the structure-based feature extraction. The

edge maps serve as a kind of ground truth data for the subsequent feature extraction pro-

cedure. Therefore, we have to take great care of the choice and usage of the proper edge

extraction technique.

The question of ”what can be regarded as an edge” must be answered before we can

derive edge detection methods. In fact, one can regard edges as locations of great impor-

tance or saliency within an image. This makes also sense from the human physiological

point of view. The human perception of objects and scenes is to a large extent based on

particular spatial configurations and changes in intensity [150].

We can define an edge as a strong intensity contrast or a jump in an image’s intensity

within a limited spatial range, i.e. intensity changes between one image pixel to the next.

Figure 2.1 shows the three most common edge profiles. Formally, we can define a step edge

profile function as follows:

Vstep =

b if x0 − f2

< x < x0

b + m if x0 ≥ x < x0 + f2,

(2.1)

with b as the background grey value and m as the change of the gray value. The edge is

localized at x0 and f the is size of the window. In Figure 2.1(a) we can see an ideal step

edge. Although the ideal case of an edge is easy to describe it almost never will occur in a

10 Edge Detection

(a) Ideal step edge (b) Ramp edge (c) Line profile

Figure 2.1: Line profiles.

real image. Usually, objects do not have real sharp boundaries as Equation 2.1 suggests.

Additionally, it is almost impossible that an image scene is sampled in such a way that the

edges are exactly located at the margins of pixels. Physical aspects such as temperature,

motion, thermal noise from the camera and others, superimpose a noise component on the

image. Therefore, edges are often represented by a ramp as shown in Figure 2.1(b).

Once, when the basic properties of edges are defined, various edge extraction meth-

ods can be derived. To date, numerous edge detection algorithms have been proposed.

It is well out of the scope of this thesis to give a complete review of all edge detection

algorithms. Some of the most known methods are Roberts, Sobel, Robinson [116], Kirsch

[65], Canny [18], Prewitt [107], Marr-Hildreth [89] or the Laplace of Gaussians (LOG)

operator. However, good overviews, reviews and evaluation of various techniques can be

found in [153], [148] and in [8]. The selection of the proper edge operator is not obvious.

In practice, the choice of an edge detector is not always driven by accurate performance

evaluation, but rather by an intuitive or empirical knowledge. Yet the question remains,

”which edge detection algorithm performs best?”. Figure 2.2 exemplarily shows the com-

parison of five different edge detectors (LoG, Prewitt, Roberts, Sobel and Canny) where

the Canny algorithm visually produces the best result1. In fact, the Canny detector is

widely used for various structure or shape-based feature extraction methods. An useful

edge extraction and evaluation procedure was presented in [47]. In the experimental part

five good working edge detection algorithms (Canny, Nalwa, Iverson, Bergholm , Rothwell)

were applied. The authors took great care on the evaluation of all methods under different

parameter settings. Independent judges visually verified the quality of edge maps obtained

from various parameter sets with all algorithms. It turned out that in case of adapted

parameters the Canny edge detector performed superior. The authors of [126] compared

seven edge detectors to the task of structure from motion. Their experiments have shown

that the Canny detector performs best and additionally it is one of the fastest algorithms.

A similar conclusion was drawn by [8]. Therefore we use the Canny method as the fist

1Note, that we will show in the sequel of this chapter how to objectively determine the quality of anedge map.

2.2 Canny Edge Detector 11

step in our structure-based feature extraction procedure. Moreover we will show how to

automatically determine the best set of parameters for the Canny edge detector. In the

next section we review the basics of the Canny edge detector in order to better understand

the parameter determination procedure.

2.2 Canny Edge Detector

The Canny edge detector is known to be a robust good working algorithm [18]. More

precisely, the Canny edge detector is optimal for step edges which are corrupted by a

Gaussian noise process.

The aims of the Canny algorithm were clearly stated in the original work of [18] in form

of optimality criteria

• Good detection criterion: Important edges should be detected and spurious responses

should be omitted

• Good localization criterion: The distance between the real and the located edge

position should be minimal

• Clear response criterion: Multiple responses to a single edge should be avoided

The first step of the Canny edge detector is to smooth the actual input image. This

has the effect of slightly blurring the image - depending on the size of the Gaussian. The

smoothing is accomplished by convolving the raw 2-D image g(x, y) ∈ R2, with a 2-D

Gaussian G(r) in polar coordinates:

G(r) =1√

2πσ2e−

r2

2σ2 , (2.2)

with r =√

x2 + y2 representing the radial distance from the origin. In order to extract

edges we need to form the first and second derivative of the Gaussian that are

G′(r) = − r√2πσ4

e−r2

2σ2 , (2.3)

and

G′′(r) = − 1√2πσ4

[

1 − r2

σ2

]

e−r2

2σ2 . (2.4)

For the second step we use the smoothed version of the image g and convolve it with

the operator Gn, which is the first derivative of G(r) in the direction n.

∂2

∂n(G ∗ g) =

∂2G

∂n∗ g = 0, (2.5)

12 Edge Detection

Laplace of Gaussian Edge Detector

Prewitt Edge Detector Roberts Edge Detector

Sobel Edge Detector Canny Edge Detector

Figure 2.2: Comparison of five edge detectors, with default parameters. Visually, theCanny method performs best.

2.2 Canny Edge Detector 13

(a) Smoothed step edge. (b) Smoothed step edge with noise.

(c) First derivative of a smoothedstep edge.

(d) First derivative of a smoothedstep edge with noise.

(e) Second derivative of a smoothedstep edge.

(f) Second derivative of a smoothedstep edge with noise.

Figure 2.3: Typical examples of edge profiles and their derivatives.

14 Edge Detection

where g denotes the image and Gn is defined as follows

Gn =∂G

∂n= n∇G, (2.6)

with n as the edge normal (gradient direction) defined as ∇(G∗g)|∇(G∗g)|

. In words, Equation 2.6

defines how to find local maxima in the direction perpendicular to an edge. In the literature

this operation is known as non-maximal suppression, i.e. local maxima can be found where

peaks in the gradient function occur. The non-maxima perpendicular to the edge direction

are suppressed, since the edge strength along the edge contour is mostly continuous. Thus

the method ensures a minimal signal-to-noise ratio (SNR) and a maximal localization of

the edge operator. This observation works pretty well for most kinds of edges but fails at

corner locations. Therefore, the Canny method is not proper for corner detection tasks.

Setting the final threshold is a common problem with edge detection in order to identify

significant edges and omitting others. The manifestation of this property is the so-called

streaking which refers to the appearance of broken lines due to edge pixels below and above

a fixed threshold. For our investigation on structure-based features - describing local and

global arrangements - streaking can be a serious problem. As we will explain in Chapter 4,

the edge map is of great importance for the computation of meaningful structure-based de-

scriptors. The lower and upper threshold hysteresis (θl, θu) of the Canny algorithm ensure

that variations of the edge value within the upper and lower threshold of the hysteresis

counteracts streaking. So that the likelihood of streaking can be extremely reduced. In

the next section we will discuss how to automatically select the three parameters for the

Canny algorithm based upon the evaluation of ground truth data.

2.3 Performance Evaluation and Parameter Selection

Though the Canny detector produces good results in general, it is not obvious how to select

the parameters. In fact, an automatic determination or selection is most desirable. Various

researchers tried to come up with evaluation procedures that can be roughly classified

into evaluation methods based on ground truth and evaluation methods without ground

truth. Evaluation methods without ground truth solely rely on human judgement (visual

inspection) of an edge detector’s result. Although, even for human beings it is difficult to

evaluate the output of edge detectors because the human perception level of coarser or finer

structures strongly varies in a complex scene [47]. The authors have shown for a set of real

world images how human individuals rate the output of various edge detectors visually.

The test persons were asked to find the best edge images for each detector from a pool of

different parameter settings. Although the analysis revealed that many of the test persons

agreed on one particular result as being the best for some images, this was not true in

2.3 Performance Evaluation and Parameter Selection 15

general. The subjectiveness of rating which edge map is the ideal one is evident, though it

depends to a large extent on an image’s content. Further, the results have shown that the

clearer a scene appeared on the image the lower was the variance in the ratings. Therefore,

we conclude that the level of a highly structured background, e.g. images of nature scenes

like forests, natural grass areas or stoney grounds, lead to very different ground truths

or evaluation results, respectively. In fact, it is nearly impossible to accurately create a

ground truth or rely on a human evaluation of such images. Even if a visual inspection of

edge images results in the selection of the best parameter set it is not applicable to larger

amounts of images or automatic processing methodologies.

Evaluation methods based on ground truth data are capable to automatically select

parameters. We agree with the authors in [144] that an evaluation of edge detection

methods with ground truth data is inevitable and can help to determine optimal parameter

sets for any kind of detector. The authors in [8] have evaluated the performance of edge

detectors by the usage of ground truth data. In detail, an edge is counted as a true positive

if it was detected within a specified region and as a false positive it is wasn’t. Various tests

have shown that the Canny and Heitger edge detectors performed best. Other automatic

and semi-automatic evaluation procedures based on ground truth data are presented in

[68] and [149].

However, the subject of automatic parameter selection remains highly subjective. The

experiments we have conducted largely agree with observations of other groups (e.g. [47])

- that even manually generated ground truths strongly depend on the individual person.

Despite of this evident evaluation problem it is necessary to develop algorithms for the

automatic parameter selection of edge detectors. We have to clearly state that our results

of automatically generated parameters only hold for the currently used image database.

Other sets might need different parameters in order to obtain good visual results.

Next we will present an automatic parameter selection procedure and apply it to the

Canny edge detector. Therefore, we define a simple yet effective measure of an edge image’s

quality. For this purpose we use sample images from the University of South Florida (USF)

image database [47] that is provided together with a manually created ground truth. In

practice the labelling process turns out to be an extreme tedious work and for very complex

scenes such as forest or mountains it is nearly impossible to correctly label all true edges.

Although it is not possible to exactly determine the perfect result for an arbitrary image

one can try to limit the detector’s parameter space to a meaningful range.

As shown earlier in this section the Canny operator can perform reasonably well in

most cases, but the results may still heavily vary depending on the parameters as shown

in Figure 2.4. So we want to determine the best possible combination of the three Canny

parameters θl, θu and σ. Hence, we densely sample the parameter range of 0 ≤ θl, θu < 1

and 0 < σ ≤ 5, by increments of 0.1, where θl is multiplied by factor 0.4 in order to obtain

16 Edge Detection

Table 2.1: Number of ground truth pixels for sample images.

Image # of GT pixels (NGTp )

207 (Car) 874843 (Telephone) 17870

101 (Electric Iron) 8833132 (Kitchen Tools) 6701

a gradual interval. Subsequently, we form combinations of θl, θu and σ resulting in 550

different sets of parameters sm = 1, 2, ..., 550. In the next step we compute the edge

map from all images and for every set of parameters. Once an edge image is obtained, the

result has to be compared with the ground truth as follows: Let pi, with i = 1, 2, ..., N,be a pixel value, where N is the number of pixels of an edge image that are in the range

of pi = 0, 1, for a binary or pi = 0, 255, for a gray scale edge map2. Then we can find

the best set of parameters Npsm

for an image that minimizes the following expression.

Npsm

= argminm=1,2,...,550

||NGTp − N sm

p ||, with (2.7)

N sm

p =

I∑

i=1

pi 6= 0;

and NGTp is the number of edge pixels different from zero for the ground truth image (see

Table 2.1). Figure 2.4 shows, from left to right, the ground truth, the best and the worst

set of parameters for several images. A visual comparison of the samples shows that the

automatically selected parameters result in edge maps of similar quality compared with

the ground truths. Moreover, the bad examples are clearly to identify.

2Similarly we can define pi for color images.


Figure 2.4: Edge maps for the Canny edge detector with various sets of parameters. The leftmost columnshows gray scale images. The second column resembles the ground truth images. The third column displaysthe best edge maps we could have automatically obtained. The parameters are listed in the first columnof Table 2.3. The last column depicts sample edge maps obtained by less performant Canny parameters,that are typically in the order of: σ ≥ 3.5 and [θl, θu] approaching 1.

18

Edge

Dete

ctio

n

Table 2.2: Falsely detected edge pixels for the ten best sets of parameters [in %]. The two right most columns indicate themean error for the best ten and best five sets, respectively. The second part of the table lists the absolute numbers of falselydetected edge pixels.

Measures [%]Image

1 2 3 4 5 6 7 8 9 10x10 [%] x5 [%]

207 1.3832 1.5318 4.0924 5.5670 5.9671 5.9785 9.3621 9.8537 10.4138 11.2140 5.97 4.0943 0.7359 1.6076 7.9928 9.1475 9.7702 12.1137 17.5818 19.9819 23.7632 24.2839 10.94 7.99101 0.0056 0.0839 0.2294 0.3302 0.5484 0.5484 0.5932 0.8226 1.0632 1.2815 0.55 0.23132 0.0149 1.3282 1.4923 1.8803 2.5220 2.8205 3.1786 3.7308 4.2382 4.6560 2.67 1.49

Measures [px]Image

1 2 3 4 5 6 7 8 9 10x10 [px] x5 [px]

207 121 134 358 487 522 523 819 862 911 981 522 35843 65 142 706 808 863 1070 1553 1765 2099 2145 966 706101 1 15 41 59 98 98 106 147 190 229 98 41132 1 89 100 126 169 189 213 250 284 312 179 100


Table 2.3: The parameters [θl, θu, σ], for the ten best edge maps. At the end of the tablewe list the averaged values.

Measures [%]Image 1 2 3 4 5 6 7 8 9 10

θl 0.04 0.04 0.04 0.04 0.04 0.08 0.04 0.04 0.12 0.08207 θu 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.3 0.2

σ 2.5 2.6 2.7 2.3 2.4 0.9 2.2 2.8 0.6 1.0

θl 0 0.04 0.04 0 0 0.04 0 0.04 0 0.04101 θu 0.01 0.1 0.1 0.01 0.01 0.1 0.01 0.1 0.01 0.1

σ 2.2 1.1 1.0 2.1 2.3 1.2 2.4 0.9 2.5 1.3

θl 0 0.12 0.08 0.12 0.04 0 0.12 0 0.12 0.0443 θu 0.01 0.3 0.2 0.3 0.1 0.01 0.3 0.01 0.3 0.1

σ 4.8 0.7 1.1 0.4 2.1 4.9 0.6 4.7 0.5 2.2

θl 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.12132 θu 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3

σ 0.4 4.0 3.9 3.8 4.1 3.7 3.5 3.6 4.2 1.8

θl 0.03 0.07 0.06 0.06 0.04 0.05 0.06 0.04 0.08 0.07θu 0.08 0.18 0.15 0.15 0.10 0.13 0.15 0.1 0.2 0.18σ 2.48 2.1 2.18 2.15 2.73 2.68 2.18 3.0 1.95 1.58

Table 2.2 lists the percentage and the absolute number of falsely detected edge pixels

for the ten best sets of parameters. The average, x10, of falsely detected edge pixels for the

ten best parameter sets is in the order of a few percent and significantly less when we only

consider the best five results, x5. The best result is obtained for image 101 (Electric Iron)

with an x10 of about 0.5%. Moreover, the table shows that there is no correlation between

the absolute number of ground truth edge pixels and false detection. That suggests that

not only the number of edge pixels but also the complexity, i.e. semantic content of an

image heavily influence an edge detector’s performance. Table 2.3 lists the corresponding

ten parameter sets.

The best results are obtained for similar parameters. Thus, it is possible to define a

range of good parameters. Since the best parameter sets have a small variance we compute

the average of the five best results for all images. Note, that we only consider the best five

sets in order to ensure a compact range for the three parameters.

Θl =1

N

N=5∑

i=1

θli, Θu =

1

N

N=5∑

i=1

θui , Σ =

1

N

N=5∑

i=1

σi, (2.8)

with θli, θu

i and σi being the average over the best five sets. Finally we arrive at the best set

20 Edge Detection

of parameters for the Canny detector that are: Θl = 0.052; Θu = 0.132 and Σ = 2.33. In

fact, we suggest the usage of the following ranges 0.045 ≤ Θl ≤ 0.065; 0.13 ≤ Θu ≤ 0.155

and 2.2 ≤ Σ ≤ 2.4.

2.3.1 Summary

In this section we have shown that the Canny edge detector produces high quality edge

maps if the proper parameters can be determined. For that purpose we have presented a

method that automatically identifies the best set of parameters for real world image scenes.

By evaluating 550 different combinations of parameters per image we could designate the

best set through the comparison with a manually created ground truth. The result shows

that the best set of parameters only produce an error (between an edge map an the ground

truth image) of 0.1 to 1.3%. The visual appearance of the resulting edge maps is not

very different from the ground truth and thus is a proof for the accuracy of the parameter

determination. Finally, we have given a range for the three Canny parameters that should

lead to high quality edge maps.

As a future work, it would be of interest to investigate how parameters may be affected

for the detection of salient edge points, i.e. to distinguish between important and less

important edge pixels.

2.4 Line-Segment Extraction 21

2.4 Line-Segment Extraction

Line segment detection is a long standing problem in computer vision with numerous

applications. Although there have been advancements during the past decades [42] the

problem remains unsolved in general. Line segments can be understood as primitive geo-

metric objects that may be used for high level feature computations. Applications can be

found in various domains such as object detection/recognition [13], [17], stereo imaging,

face recognition [36], content-based image retrieval [83], segmentation, object contour en-

coding, image descriptors, object tracking, sketch-, logo-, and watermark-matching. For

imagery applications such as the localization of buildings [85] or road networks it is of

essential importance to accurately detect line segments, too. In [106] an automated road

segment extraction system is presented, where intrinsic properties of roads are incorpo-

rated. The method assumes each road segment being a part of a sequence of connected

road objects. These road objects must obey a set of rules, which reflect empirical knowl-

edge of the appearance of road networks in aerial images. Subsequently, the road objects

are grouped into road segments. The algorithm is tested with two high resolution images,

where about 80% of the road network was detected.

In [137] a method named sequential fuzzy cluster extraction is presented. The algorithm

extracts single clusters by maximizing the sum of all similarities among members. Thus,

weak collinear line segments are clustered as one single segment. The approach was applied

to a hand drawn sketch image in order to obtain a more compact representation. Though

the results are interesting, more comprehensive experiments (i.e. with more sketch images)

are missing.

In [99] a method is presented in order to detect and estimate straight patterns in videos.

The method involves the Radon transform and a whitening filter as a likelihood processor

which in turn determines a statistic for a global signal detection. The authors applied

the proposed algorithm to videos of vehicles obtained from highway gates, where they

estimated the motion. Thus, this approach detects straight patterns in a spatio-temporal

context rather then in single images.

The authors in [76] applied line detection for scene modelling by focusing on horizontal

and vertical lines. In detail, the method is block-based, where for every block a local

accumulator is created which is checked for the presence of line segments. In order to

apply the method five parameters have to be adjusted, where the authors do not fully

describe how they are determined. Although the authors claim a lower computational

time then for the Hough transform, no run-time comparison is presented. The method

is applied to few test images but the analysis lacks of a comprehensive evaluation and

parameter discussion.

In the sequel we will shortly review the standard Hough-transform and present an edge

tracking algorithm for the task of straight line segment extraction.

22 Edge Detection

0 20 40 60 80 100 1200

50

100

150

200

250

300

X−axis

Y−

axis

θ

CartesianSpace

r

Hough transform

θ

r

−80 −60 −40 −20 0 20 40 60 80

−600

−400

−200

0

200

400

600

Figure 2.5: Left panel shows a straight line in the Cartesian space. In addition, r and θ areplotted to show the parameterization of the line. Right panel the Hough transformationof the single line from the left. Note, that the pattern originates from the quantization inthe Hough space.

2.5 Hough Transformation

The Hough transformation [52] is an image analysis method which is commonly used in

order to detect straight lines, circles, polynomials or others in images. The Hough transform

for detecting straight lines in an image will be shortly reviewed in the following. A line

segment is parameterized by two variables r and θ as follows:

r = xi cos(θ) + yi sin(θ), (2.9)

with i = 1, 2, ..., N, and N the number of collinear points and r is the perpendicular

distance from the origin of the coordinate system and θ is the angle of r with respect

to the X-axis. The Hough transform actually maps collinear points from the Cartesian

space into sinusoidal curves in the Hough space. In detail, each collinear point (xi, yi) is

transformed into a sinusoidal curve (parameterized according to Eq. 2.9) in the Hough

space, where all curves intersect in the point (r, θ). In the Hough space the intersections

can be visualized easily, as shown in Figure 2.5. Usually, the Hough space is quantized

into accumulator cells (finite intervals), where each point (xi, yi) votes in the (r, θ) space

for all possible lines passing through it. In other words,the more votes a position in the

accumulator matrix obtains, the more feature points lay exactly on this line.

However, in order to extract straight line segments in the original image the (r, θ) inter-

section peaks have to be extracted. A common way is to locate local maxima by introducing

a threshold in the accumulator array. The final step is the so-called de-Houghing, which is,

the mapping from the Hough space back into the Cartesian space. Thus, we finally obtain

2.6 Edge Point Tracking 23

a set of points (xi, yi) describing straight line segments in the original image.

2.6 Edge Point Tracking

In order to extract straight line segments we have adopted the algorithms of [105] and

[69]. In the first step we create an edge map with the Canny detector according to the

findings of Chapter 2. Next, the actual algorithm begins with an initial scan through the

binary edge map, where the neighborhood of every edge pixel is investigated. We have

decided for a binary representation for reasons of complexity. Intensity edge maps would

strongly increase the computational time per image. Let pi ∈ 0, 1 be a binary edge

pixel, with i = 1, 2, ..., N, and N is the number of image pixels. In detail, each detected

edge pixel pj = 1, with j = 1, 2, ...N e, where N e denotes the number of edge pixels

serves as a starting point in order to track neighbor pixels, that are eight connected. The

tracking proceeds as long as no more neighbor pixel can be detected. Then the procedure

starts again with the same initial pixel pj, but into the other direction. Thus, we obtain

a list of edge pixels listj , where all members are assigned a label. This iterative tracking

is performed for all edge map pixels pj resulting in j lists. Subsequently, for every listjthe beginning pb

j and end point pixels pej are determined in order to from the line segment

lj . Next, a second parameter is introduced which controls the threshold of the maximum

allowed line tolerance, i.e. pixels that are too far off the line segment lj (line connecting the

first and the last position in the edge pixel list listj). This step is repeated as long as the

actual distance deviation falls below the tolerance threshold. Once all edge pixel lists are

tested for the tolerance parameter, we obtain a set of line segments L = lj |j = 1, 2, ..., N e,with lj = [pb

j pej ], and pb

j , pej are the start and end points of a segment. Occasionally it

happens that some segments are falsely tracked. In order to compensate this effect we

perform a final procedure that (re-)connects broken segments.

Basically, this procedure performs quite similar to the initial pixel tracking algorithm.

Therefore we take the list of line segments L in order to check for segments that lie within

a specified distance and angle tolerance. Thus two segments lk and ln must simultaneously

fulfill the following constraints

θ(lk, ln) ≤ T ang, (2.10)

dist(lk, ln) ≤ T dist, (2.11)

with θ(lk, ln) as the angle between two segments and dist(lk, ln) stands for the Euclidean

distance between two lines. T ang denotes the angle tolerance and T dist the distance thresh-

old. If both conditions are met at the same time then two segments are merged into one.

The practical experience shows that angle deviations between two segments of less then

24 Edge Detection

five degrees are meaningful. For the distance tolerance a few pixels (one to eight depending

on an image’s resolution and size) are sufficient in order to obtain good results.

Evaluation: Next, we want to apply the Hough-transform and the edge-point tracking

algorithm to a real world image. In Figure 2.6 four images are displayed where we have

overplotted the obtained line segments. The left images show segments obtained with

the Hough-transform and the lines on the right images were detected with the edge point

tracking algorithm. We can see that for the building image the Hough-transform performs

poorly in comparison to the edge tracking algorithm. The Hough-transform3 recognized

only about one third of the segments detected by the edge point tracking method. In fact,

it is known that the Hough-transform can produce misleading or false results in the case

of objects that happen to be aligned. The Hough-transform result for the second image

looks at the first sight more promising. A more detail look reveals that some of the most

salient contour lines of the puma (back and tail) are not detected.

The bad results of the Hough-transform may be due to quantization issues. It is com-

mon to describe the mapping from the Cartesian to the Hough space as a two-dimensional

histogram, where the number of bins turns out to be a crucial factor. A too rough reso-

lution, i.e. too few bins, will most likely map spatially close and almost parallel lines to

the same bin. On the other hand, a very fine resolution will lead to a mapping of collinear

points into separated bins and not into one accumulation point, as theoretically assumed.

In the literature one can find various enhanced Hough-transformation algorithms that

might work better then the standard method, such as for example [35] or [96].

However, our observations clearly show the superior performance of the edge tracking

algorithm over the Hough-transform. Therefore, it is our preferred choice for the extraction

of straight line segments. In the next chapter we will use the edge point tracking in order

to extract straight line segments that will serve as input for the hierarchical clustering

algorithm.

3The presented results were obtained with the Matlab implementation of the Hough-transform withdefault parameters.

2.6 Edge Point Tracking 25

Line Segments from Hough Transform Line Segments from Edge Point Tracking

Line Segments from Hough Transform Line Segments from Edge Point Tracking

Figure 2.6: Comparison of line segments obtained from the Hough-transform and the edgepoint tracking algorithm.

26 Edge Detection

Chapter 3

Clustering

Cluster analysis is a method of multi-variate statistics to reveal homogeneous groups of

objects or training patterns. Since no a priori knowledge, such as labels of data patterns is

available clustering techniques belong to unsupervised classification methods. One could

roughly describe clustering as the process of organizing patterns into meaningful entities,

who’s members are somewhat similar to each other. Clustering algorithms try to discover

similarities and differences among data patterns and to derive meaningful conclusions about

them. Indeed, the paradigm of clustering is met in many, if not in most scientific disci-

plines, where any kind of data patterns occur. The most important applications reach from

medical sciences, social sciences, physical sciences, life sciences (biology, zoology) to engi-

neering and computer sciences. It would go beyond the scope of this thesis to describe all

clustering techniques in detail. The interested reader is refereed to the following literature

on clustering techniques: [5], [147], [63] and [60].

In this work we are using hierarchical clustering methods that are described in this chap-

ter. In order to classify hierarchical clustering within the zoo of clustering techniques, the

major methods are sometimes divided into sequential algorithms, hierarchical algorithms

and algorithms based on cost function optimization [136]. Whereas, from the algorithmic

point of view we can distinguish the following methods1:

• Deterministic clustering methods: Each data object is assigned to exactly one cluster,

such that the clustering defines a partition of the data

• Function-based clustering: Function-based methods decide the data class member-

ship on an assignment function that has to be optimized. Thus, function-based

clustering methods are optimization problems.

• Possibilistic clustering methods: Each data object gets a possibility value assigned,

1Note, that this taxonomy of clustering algorithms is methodically and is not the only possible one.Moreover, [31] and [63] give a more detailed overview of possible taxonomies of clustering methods.

28 Clustering

that determines if the object belongs to a certain cluster. Possibilistic clustering

methods are often referred to as pure fuzzy clustering. where the sum of all possibil-

ities for one data object belonging to a cluster has not to equal one.

• Probabilistic clustering methods: For each data object a probability distribution is

determined, which defines the probability of any data object belonging to a certain

cluster.

• Hierarchical clustering methods: The data to cluster are subdivided into several steps

in finer and finer or in coarser and coarser groups. The process of merging groups is

done by so-called linkage.

3.1 Hierarchical Clustering

Hierarchical clustering algorithms represent a given data-set as a sequence of partitions,

where each partition is nested into the next higher partition in the hierarchy. Formally,

there are two hierarchical algorithms, namely the agglomerative and the divisive one. The

agglomerative algorithm is a bottom-up technique, where initially all data patterns belong

to disjoint clusters, which are subsequently merged into larger and larger clusters until only

one cluster remains. The divisive algorithm on the other hand is a top-down technique,

that first starts with all data patterns in one cluster and consequently is subdivided into

smaller clusters. Both algorithms end up with an hierarchical representation of the data

- the so-called dendrogram (tree structure), where each step in the clustering process is

illustrated by a join of the tree. Figure 3.1 shows a flow-chart of the hierarchical clustering

algorithm.

For our problem, the grouping of line segments, we only consider the agglomerative

hierarchical clustering algorithm (AHC). The reason for that is the higher complexity of

the divisive method which makes it practically unusable for the present application. In

addition, AHC is our preferred choice over other clustering techniques, such as Linde-Buzo-

Gray (LBG) or k-means algorithm, where the number of clusters has to be known at the

onset.

Usually, the number of clusters is unknown in advance and additionally may vary for

each image. Hence, we have decided for hierarchical clustering. Moreover, the AHC omits

the cluster-center initialization problem, which has a crucial impact on the performance of

the k-means algorithms.

Clustering or grouping of line segments is a long standing problem of computer vision

and was already studied by [40], where the authors used the segment lengths for the clus-

tering process. A more recent work was presented in [62], where the grouping is done

according to ratios of segment lengths and their distances. In an iterative process a hier-

3.1 Hierarchical Clustering 29

Figure 3.1: Flow chart of the agglomerative hierarchical clustering algorithm.

30 Clustering

archy of line segment clusters is produced. Unfortunately, the authors have applied their

method to just one image. A different approach is presented in [108]: the line segments are

detected by a Hough-transform in subwindows. Subsequently, the segments are merged

across different windows, where the process is highly sensitive to the window size. The

authors in [117] performed a grouping of line segments based on Eigenclustering. More

related work is mentioned in Section 1.3.

In detail, the agglomerative hierarchical clustering is a special clustering technique that

groups data over a variety of scales. The clusters are represented in a multi-level hierarchy,

or dendrogram, where clusters at one level are joined to form clusters at the next higher

level. AHC produces a number of partitions CnNn=1 with N denoting the number of

clusters. AHC initially consists of N clusters, each filled with a single element of the input

vector X = xn; n = 1, ..., N. At each of the N subsequent clustering steps, two clusters,

that are closest to each other, are joined together.

The following algorithm summarizes the steps of the general agglomerative hierarchical

clustering schema (sometimes called generalized agglomerative scheme (GAS)).

Algorithm Agglomerative Hierarchical Clustering

1. p = 0 Partition level

2. Initialize N singleton clusters:

3. Select the initial clusters: Tp=0 = Cn = xn, n = 1, 2, ..., N4. p = p + 1

5. R = Np(Np−1)2

All possible pairs of clusters for the current clustering Tp

6. for all R

7. D(Ci, Cj) = min1≤k,l,≤RD(Ck, Cl)

8. Create a new cluster from Ci and Cj:

9. Cs = Ci

⋃

Cj

10. Update the clustering Tp

11. Repeat Steps 4 to 10 until all vectors lie in a single cluster

It is easy to notice that the agglomerative hierarchical clustering forms a nested hierarchy

Tp=0 ⊂ T1, ...,⊂ TN−1, where p are the partition levels for each clustering step.

The next section is dedicated to on how we actually compute the distance between

cluster pairs in order to form the hierarchy.

3.2 Linkage

The formation of a cluster hierarchy can be seen as the iterative application of a dissimi-

larity function to a set of possible pairs of clusters of a data matrix X. The step of merging

clusters into larger and larger clusters until only one cluster remains is called linkage. More

3.2 Linkage 31

specifically, linkage is the criterion by which the clustering algorithm determines the ac-

tual distance between two clusters by defining single points that are associated with the

clusters in question. Before we define several linkage methods it is necessary to introduce

the dissimilarity matrix in order to compute distances between clusters.

Dissimilarity Matrix It is essential for hierarchical cluster analysis to measure the

closeness of two objects or a pair of clusters, which is done by the so called dissimilarity

matrix (or proximity matrix). It is a symmetric square N ×N matrix D with dij elements.

Each element represents a measure of distinction between the i-th and j-th object or vector.

The matrix features zero diagonal values stating that the self-distance is zero. We can define

a dissimilarity measure d for a data set X as a function: d : X × X → r ∈ R+, with r as

the set of real numbers. Note that d features the same properties as the cophenetic matrix

(see Equation 3.10).

Linkage Methods Subsequently, we will define the methods for our experiments that

are single, complete, average, centroid, median and ward linkage.

Single linkage defines the distance between any two clusters as the minimum distance

between them, i.e. the distance between the two closest points

d(k, l) = min(dist(xki, ylj)); k ∈ (1, ..., ni), l ∈ (1, ..., nj), (3.1)

where ni and nj are the number of objects in cluster k and l, respectively. xki denotes the

i-th object in cluster k and ylj the j-th object in cluster l. This method tends to produce

elongated clusters which is known as chaining effect.

Complete linkage is the opposite of single linkage in that it defines the distance between

any two clusters as the maximum distance between them.

d(k, l) = max(dist(xki, ylj)); k ∈ (1, ..., ni), l ∈ (1, ..., nj), (3.2)

where ni, nj, xki and ylj have the same meaning as in the single linkage method. In

comparison to single linkage, the complete method tends to form tightly bound clusters

[151].

Average linkage, also known as UPGMA (Unweighted Pair-Group Method using Arith-

metic averages) takes the mean distance between all possible pairs of entities of the two

clusters in question. Therefore, it is computationally more expensive than the previous

methods.

d(k, l) =1

nkns

nk∑

i=1

nl∑

j=1

dist(xki, ylj) (3.3)

Centroid linkage, sometimes called Unweighted Pair-Group Method using Centroids

32 Clustering

161817 3 15 5 1 2 8 292324 4 9 1428 7 26192730201113 6 21121025220

2000

4000

6000

8000

10000

12000

14000

16000

Single Linkage Method

Object Number

Dis

tanc

e

2 4 8 6 1 5 121017 9 3 131621 7 2628291130191524222314201827250

0.5

1

1.5

2

2.5

x 105 Ward Linkage Method

Object Number

Dis

tanc

e

1719 5 6 1124 1 18 2 8 212628 3 4 13 7 2214301027202916 9 151225230

1

2

3

4

5

6

7

8

9

10x 10

4 Complete Linkage Method

Object Number

Dis

tanc

e

1 18 2 4 2126 3 5 1719 6 8 1125 7 163013102822202915 9 23141227240

1

2

3

4

5

6

7

x 104 Average Linkage Method

Object Number

Dis

tanc

e

1 18 2 4 2126 3 5 1719 6 8 1125 7 163013102822202915 9 23141227240

1

2

3

4

5

6

7

x 104 Centroid Linkage Method

Object Number

Dis

tanc

e

1719 6 8 1125 7 163013 1 18 2 3 2126 4 5 9 23141028222029151227240

1

2

3

4

5

6

x 104 Median Linkage Method

Object Number

Dis

tanc

e

Figure 3.2: Various dendrograms obtained from six different linkage methods.

3.2 Linkage 33

(UPGMC) takes distances between the centroids of two groups.

d(k, l) = dist(xkyl), with (3.4)

xk =1

nk

nk∑

i=1

xki, (3.5)

xl =1

nl

nl∑

j=1

xlj . (3.6)

Ward’s linkage [145] attempts to minimize the sum of squares of two clusters that are

formed at each step and usually tends to create clusters of small size.

d(k, l) = nknld(k, l)2

(nk + nl), (3.7)

where d(k, l)2 is the centroid distance between cluster k and l, as defined in the centroid

linkage.

The last linkage method we consider is the so-called median linkage that uses the

distance between weighted centroids of two clusters.

d(k, l) = dist(xkyl), with (3.8)

xk =1

2(xp + xq), (3.9)

where xk was formed by combining clusters p and q. xl is similarly defined.

In Figure 3.2 we can see the six linkage methods applied to a hierarchical clustering

problem, i.e. the clustering of a set of straight line segments obtained from a color image.

The different dendrograms show the final hierarchy, where for the case of single linkage the

chaining effect can be easily observed.

In the sequel, we will present measures for the validation of the obtained clustering.

Therefore, we will firstly introduce the cophenetic matrix, that is a tool for cluster valida-

tion. Secondly, we will evaluate the clustering result for every linkage method in order to

determine the proper one. Thirdly, we will discuss how to cut or prune a dendrogram in

order to obtain the final clustering and we will introduce a pruning method that is based

on a subgraph distance ratios.

3.2.1 The Cophenetic Matrix

An important quantity of hierarchical clustering algorithms is the cophenetic matrix. Based

on a clustering hierarchy diagram we can define the cophenetic matrix, which consists of a

34 Clustering

set of cophenetic distances dc defined as in the following. First, assume that Tpijrepresents

the clustering where xi and xj are for the first time merged in the same cluster. Further,

Lpijis the proximity level where the clustering Tpij

has been formed. Then the distance dc

can be written as

dc(xi,xj) = Lpij. (3.10)

The cophenetic matrix is subsequently defined as:

Dc = dc(xi,xj), (3.11)

where i and j = 1, ..., N, with N being the number of elements of x. The cophenetic

matrix fulfills the properties of a metric, such that the following conditions are met:

Dc(i, j) ≥ 0, i 6= j Non-negativity (3.12)

Dc(i, j) = 0, i = j Zero selfdistance (3.13)

Dc(i, j) = Dc(j, i) Symmetry (3.14)

Dc(i, j) ≤ Dc(i, k) + Dc(j, k), ∀ i, j, k Triangle inequality. (3.15)

The first condition holds clearly since there are no negative distances between clusters

possible. The selfdistance must be zero, since an element can be found in the same cluster

with itself at the zero level clustering. The symmetry condition obviously holds, too. In

fact, even the ultrametric inequality holds, which is a stronger condition than the triangle

inequality. The ultrametric inequality states in this case that for every triplet of distances,

the two largest distances out of the three possible ones are equal [46].

Dc(i, j) ≤ max[Dc(i, k),Dc(j, k)] ∀ i, j, k. (3.16)

The cophenetic matrix Dc(i, j) is a special case of a dissimilarity matrix [136].

3.3 Cluster Validity

Cluster validity quantitatively evaluates the results of a clustering algorithm. In the liter-

ature several methods have been proposed that are reviewed in [45] and [39].

Most clustering algorithms impose a kind of cluster structure on a data set X, that is a

priori not evident. Moreover, different clustering techniques will create different clusters,

which might be correct or not. Thus, before we think of a quantity to measure the quality

of the obtained clustering, we need to answer a question that naturally arises; ”Is there a

natural structure in our data at all?”.

3.3 Cluster Validity 35

3.3.1 Cluster Tendency

Methods for identifying the presence of a clustering structure are called clustering ten-

dency. The basic approach is to test points of a dataset for randomness. Various test for

randomness in datasets have been suggested in the literature, such as [7] who proposed a

Poisson model where objects are represented as uniformly distributed points in a region

R of the l-dimensional data space. In [104] the authors compared the classical Hopkins

test [51] to an approach based on fractal dimension theory (FDT) and fuzzy approximate

reasoning (FAR) to analyze the clustering tendency. FDT investigates the real dimension

of a dataset and FAR is a procedure that deduces (imprecise) conclusions from fuzzy rules.

Their results showed that the Hopkins test is able to robustly determine the clustering ten-

dency in an unknown dataset. Hence, due to its outstanding and thoroughly investigated

performance we decided for the Hopkins test in order to determine the cluster tendency of

our data.

Hopkins Test In the following we will shortly review the Hopkins test which is based

on the nearest neighbor distance, that are the distances between randomly sampled points

and points from the actual distribution. In detail, let X = xi, i = 1, ..., N, where N

is the number of elements of X. Further let Xs ⊂ X with M randomly selected vectors

from X, defined as X = xi, i = 1, ..., M, M ≈ N10

. Also let Xr = yi, i = 1, ..., Mbe a set of vectors randomly distributed according to the uniform distribution. Now we

define dj as the distance from yj ∈ Xr to its nearest vector in xj ∈ Xs. Moreover let d′j

be the distance from xj to the nearest vector in Xs − xj. Now we can write down the

complete Hopkins statistics with the lth powers of dj and d′j accordingly to [59] and [136]

as:

h =

∑Mj=1 dl

j∑M

j=1 dlj +

∑Mj=1 d′l

j

, (3.17)

with 0 ≤ h ≤ 1. In words, the Hopkins statistic index for clustering tendency exam-

ines whether objects in a dataset differ significantly from the assumption that they are

uniformly distributed in the multidimensional space. The statistical test compares the dis-

tances between the real data and their nearest neighbors. In case the data are uniformly

distributed, dj and d′j will be similar resulting in a Hopkins statistic of h=0.5. On the other

hand if clusters are present, the distances for the artificial random data will be larger than

for the real ones, because the random objects are evenly distributed and the real ones are

grouped together. In this case the Hopkins statistic will result in values larger than 0.5.

In our experiment we intend to prove the assumption that we can reject the null hy-

pothesis of a randomly or regularly distributed feature space. In fact, randomness of a

data distribution indicates that an alternative method should be incorporated for the data

analysis. Therefore, we create a random data set according to the uniform distribution,

36 Clustering

that is used to be tested against our data. The actual data to be tested are the coordinates

of all line segments for each image obtained from the Caltech database (see Section 5.1.3).

Thus, for every database image we evaluate the Hopkins test, resulting in hi values, with

i = 1, 2, ..., 2662. Finally, we average over all hi ending up at a Hopkins test of h=0.81,

that indicates a strong clustering structure, such that we can reject the null (randomness)

hypothesis. Hence, we can safely assume a clustering tendency in our data. A strong

verification for our conclusion is given in the work of [73] where the authors introduced

a modified Hopkins statistics in order to get a measure how much the data are clustered.

Their findings clearly showed that a Hopkins h value of 0.75 or higher is an evidence for a

clustering tendency at the 90% confidence level.

Now, that we have proven the adequateness of clustering techniques for our data we

can proceed with the description of our clustering strategy. Next, we will present the

hierarchical clustering method we have successfully applied to our data. Subsequently, we

are going to motivate the clustering method used, as well as the choice of the best suited

linkage algorithm and the final dendrogram cutting and pruning technique, respectively.

Again, we will provide a proof of the final clustering quality by the cophenetic correlation

coefficient introduced in Section 3.18.

3.3.2 Clustering Validation

An important issue is the validation of a clustering result that is, for hierarchical clus-

tering algorithms, represented as a dendrogram. In Section 3.2 we have shown how to

represent a hierarchical clustering structure by a cophenetic matrix. Now, we can define

a coefficient that measures the degree of similarity between the cophenetic matrix Dc and

the dissimilarity matrix D obtained from the dataset X. As stated in Paragraph 3.2, the

cophenetic and dissimilarity matrices are symmetric, i.e. their diagonals are zero. Hence,

we only need to take the upper triangle matrix of every matrix into consideration, with

O ≡ N(N − 1)/2 elements. Then the cophenetic correlation coefficient (CCC) is defined

as a follows:

CCC =1O

∑N−1i=1

∑Nj=i+1 dc

ijdij − µDcµD

√

(

1O

∑N−1i=1

∑Nj=i+1 dc2

ij − µ2Dc

)(

1O

∑N−1i=1

∑Nj=i+1 d2

ij − µ2D

)

, (3.18)

where µDcand µD are the respective means, that are, µDc

= 1O

∑N−1i=1 dc

ij and µD =1O

∑N−1i=1 dij . The values of the CCC are in the range of [−1, 1], where values closer to

1 indicate a better agreement between the cophenetic and the proximity matrix. Hence,

the CCC is a measure of how accurately the hierarchical tree represents the dissimilarities

of the original input data.


Table 3.1: Averaged cophenetic correlation coefficients for six linkage methods.

Data pattern Linkage Method Average CCCAngles Single 0.7350

Centroid 0.7808Ward 0.7415Average 0.7808Complete 0.7228Median 0.8035

Lengths Single 0.8632Centroid 0.9262Ward 0.7208Average 0.9262Complete 0.8907Median 0.8920

In order to verify the quality of our clustering we check to which extent it fits the actual

data. Thus, we compute the cophenetic correlation coefficient for the clustering result

obtained with the AHC under the usage of six linkage algorithm, defined in Section 3.2.

For this experiment we have taken six different clusterings obtained for every image of

the Caltech database presented in Section 5.1.3. In detail, we take an imagei with i =

1, 2, ...2662 and extract a set of line segments that is the ground truth data of every

image. Then, we compute the Euclidean distance matrix (EDM, see Chapter 4.1), where

we use the angles between any segment and the ordinate2. In addition, we compute an EDM

from the relative segment lengths. The relative length of each segment is computed as the

fraction of the longest possible segment of an image, its diagonal and computed according

to Equation 4.9 Once the EDMs are computed the AHC can construct all dendrograms

under the usage of different linkage methods (see Section 3.2).

In Table 3.1 the cophenetic correlation coefficient is printed for various linkage meth-

ods 3. We can see that most of the CCC values are close to one, indicating a high quality

clustering. The average linkage method gives the best result for the length data and the

median linkage the best for the angle data. The centroid method is of similar quality.

The single linkage gives rather poor results in comparison to the others. In [9] the author

confirms a similar observation. Now, when we have determined the quality of the various

linkage methods it is time to think about obtaining the final clusters. That is done by

cutting or pruning the hierarchy.

2Note, that during the actual EDM computation only differences are considered. That makes the resultinvariant against similarity transformations

3Note, that CCC is averaged over all images from the Caltech database.

38 Clustering

3.3.3 Cutting a Dendrogram

The result of the hierarchical clustering is presented in the form of a dendrogram, as

shown in Figure 3.2. In order to determine the final clusters we have to cut or partition

the hierarchy. Two groups of approaches exist, where the first one is the cutting of the

dendrogram at a given height, that is, the distance between the nodes in the graph [31]. The

second method is to prune the dendrogram by a manual or automatic selection of clusters at

various distances, such as applied in [90] or [132], where a nearest neighbor purity estimator

was introduced in order to determine the pruning point. The authors in [90] concluded

that their proposed method tends to create small clusters. In our experiments we explored

a different approach. We propose an automatic partitioning algorithm that relies on the

distances between nodes taken from a dendrogram. Note, that a dendrogram is a binary

rooted tree, in which every node has either two children (child nodes to the left and to the

right) or is a leaf.

We define the subsequent node distance SDnodei,j with i = 1, 2, ..., Nnode, and j =

i+1l, i+1r where Nnode is the number of nodes in the dendrogram, with il and ir defining

the subsequent (child) nodes to the left and to the right, respectively. Specifically, SDnodei,j

is the dendrogram’s ordinate value between any two subsequent nodes within a subtree.

Then, we define a distance ratio between nodes in any subtree as:

DRsubm =

argminj=i+1l,i+1r

(SDnodei,j )

argmaxk=i+2l,i+2r

(SDnodej,k )

, (3.19)

where m = 1, 2, ..., M − 1, with M as the number of subtree nodes. The pruning of the

dendrogram is made at DRsubm ≤ 1

4, where 1

4turned out to work best with respect to well

separated clusters. An example of a dendrogram pruning result is shown in Figure 3.3.

As discussed in the begin of this chapter, we are interested in grouping straight line

segments. Now, we want to select the final clusters that will be used for the further feature

computations and the subsequent content-based image retrieval and classification tasks.

From the final set of clusters4 we will only select the most compact ones for further con-

siderations, since all members are more similar to each other than to any other subgroup;

such clusters show a high mutual similarity. As the measure of compactness (intra-cluster

distance) we use

σj =

(

1

nc

∑

x∈Cj

||x − wj||2)

1

2

, (3.20)

where Cj denotes an individual cluster with data members x and wj is the j-th cluster

representative. Finally, we select all clusters with a σj ≤ stdσ, where stdσ is the standard

4Note, that the number of actual cluster may vary from image to image.


Figure 3.3: The two graphs show dendrograms where the ellipse in the left one enclosesa typical subgraph. SD12 and SD23 are the distances between nodes in the enclosedsubgraph. The right dendrogram pictures the actual pruning for the segments of thetemple image (see Figure 3.4).

40 Clustering

deviation of all σj .

Figure 3.4 and Figure 3.3 illustrate the last two steps: the pruning of the dendrogram

and the final selection of clusters based on the compactness measure σj . The actual clus-

tering was performed with the average linkage method that incorporates the combination

of segment angle and segment length data patterns, as previously described.

The first row in Figure 3.4 shows a color image and its straight line segments. The

second row shows two sets of line segments, where the left image represents members of

less compact clusters and the right image denotes clusters of high compactness, according

to Equation 3.20. The left set of segments can be interpreted as less important and, thus,

will be ignored for the structure-based feature computation presented in Chapter 4. Similar

results are obtained for other images. As a final remark it has to be said that it is not

possible to precisely determing all salient line segments of an image. However, the salient

segments clearly show a tendency of being more visually more significant then the discarded

ones. The further results presented in this thesis support this observation.

3.4 Summary and Conclusion

In this chapter we have described the important unsupervised learning problem of ag-

glomerative hierarchical clustering, that was applied to the task of grouping straight line

segments. Firstly, we have described the algorithm and six linkage methods. Secondly,

we have proven the a priori assumption of an underlying clustering structure of our data

with the Hopkins test. The result of h=0.81 indicates a strongly structured dataset that

supports the usage of clustering methods for the grouping of line segments. Thirdly, we

have evaluated the quality of 159725 hierarchies obtained from line segment patterns of the

Caltech database. The cophenetic correlation coefficient was used as a quality measure.

With an averaged CCC score of 0.9262 the average linkage method produces the best re-

sult for our dataset. Fourthly, we have presented a subgraph distance ratio method that is

used for pruning or cutting the dendrogram, respectively. The results show that the ratio

is well suited for measuring the distance between clusters. Finally, we base the selection

of meaningful or salient clusters on the intra-class compactness, resulting in a subset of

straight line segments, that is used for further computations.

5Six times the number of images (2662) from the Caltech database.

3.4 Summary and Conclusion 41

0 50 100 150 200 250 300 350 400

0

50

100

150

200

250

0 50 100 150 200 250 300 350 400

0

50

100

150

200

2500 50 100 150 200 250 300 350 400

0

50

100

150

200

250

Figure 3.4: The first row shows a color image and all line segments. The second row presentsline segment clusters, where the left graph illustrates noise-like segments (non-compactclusters). Whereas the right graph shows a salient subset of line segments according to thecompactness measure (see Equation 3.20).

42 Clustering

Chapter 4

Structure-Based Features

The structure of an image is a fundamental source of salient information. In fact, it is

long known that a certain perceptual organization plays a fundamental role in the human

early vision system. This law of organization dates back to the Gestalt psychologists

[66] who defined various heuristics such as for example proximity, symmetry, similarity or

continuation. The work of [93] incorporated perceptual grouping for computer vision tasks.

Specifically, the authors concentrated on geometrical relationships in images, where tests

were run on about 20 images.

In [30] a method is presented that groups line segments in perceptually salient contours.

The algorithm results in a hierarchical representation of the contour. According to the

authors the method is robust against texture, clutter and repetitive image structures. The

results look promising, however, the experimental analysis is restricted to a few images only.

This method can not be easily applied to CBIR, since objects in an image rarely have a clear

contour. Complex and cluttered scenes are rather the normal cases. More advancements

in the area of perceptual grouping can be found in [58] described in Chapter 1.3.

In this chapter we derive features that are computed from the geometric structure of

images. In the previous chapters we have shown how to robustly compute edge maps

and form line segments. A hierarchical clustering algorithm was used in order to select a

salient subset of line segments, that will be used for the following computations. As we

have motivated in Section 1.3, structure is of importance on several scales. Therefore, we

compute a hierarchy of structural features, namely global and local ones. The former ones

depict a holistic scene representation and the latter ones take local perceptual groups and

their connectivity into account.

In the sequel, we present Euclidean distance matrices, the extraction of global features

and the definition of local perceptual groups.

44 Structure-Based Features

4.1 Euclidean Distance Matrix

An Euclidean Distance Matrix (EDM) is a two-dimensional array consisting of distances

taken from a set of entities, that can be coordinates or points from a feature space. Thus,

and EDM incorporates distance knowledge between points.

Revealing information solely obtained from point distances is of major importance in

various areas of research such as genetics, pattern recognition, molecular configurations,

protein folding, geodesy, economics, statistics, psychology, chemistry, engineering and many

others.

A set of points in an Euclidean space is characterized by their coordinates. Specifically,

an Euclidean Distance Matrix is a real valued n × n matrix E containing the squared

distances of pairs of points from a table of n points (Xk, Yk), with k = 1, 2, . . . n. An

EDM is defined as

E =

e11 e12 · · · e1n

e21 e22 · · · e2n

......

. . ....

en1 en2 · · · enn

,

where

eij =√

(Xi − Xj)2 + (Yi − Yj)2, with i,j =1,2,. . . ,n, (4.1)

describes the Euclidean distance between the feature points i and j.

An EDM E inherits the following properties from the norm ‖ · ‖2:

1. eij ≥ 0, i 6= j non-negativity

2. eij = 0, i = 0 self-distance

3. eij = eji symmetry

4. eij ≤ eik + ekj, i 6= k 6= j triangle inequality

The dimension of E is n2, but as E is symmetric with zeros along the diagonal, there

are only n(n−1)2

unique elements.

Figure 4.1 shows a set of five points and their corresponding EDM distances. In addi-

tion, we can see the same set of points under rotation, translation and scaling, respectively.

As we will show, an EDM remains constant under similarity transformations.

A common translation of all points will not effect an EDM E since the change of the

point co-ordinates is nullified, as shown in Equation 4.1. Assume we want to shift all points

4.1 Euclidean Distance Matrix 45

P4

P1

e12

P2

P5

e13P3

e14

e15

(a) A set of five points.P5

P4P2

P1P3

e13

e12

e15

e14

(b) Translated set of points.

P1

e14

e13

e15

P5

e12

P2

P3

P4

(c) Rotated set of points.

P1

P2

e13

P4e15

e12

P5

e14

P3

(d) Scaled set of points.

Figure 4.1: A graphical illustration of a set of points and three corresponding transformations(translation, rotation and scaling). Note, that we only show the EDM elements (e12 to e15) for thepoint P1.


P = pk; pk ∈ Rn, k = 1, 2, . . . , n, by the translation vector1 t = (t1, t2)T , t1, t2 ∈ R,

that is in 2-dimensions (Xk, Yk) 7→ (Xk + t1, Yk + t2). Such that the resulting EDM can be

written as

E(Xk, Yk) = E(Xk + t1, Yk + t2). (4.2)

In general, an Euclidean distance matrix remains unaltered in case of an arbitrary trans-

lation2 t ∈ Rn.

Similarly, an EDM is invariant against rotation. Let us consider a rotation in R2, with

Rθ being a rotation matrix3 defined as

Rθ =

[

cos(θ) sinθ

−sinθ cos(θ)

]

. (4.3)

Note, that a rotation matrix is always orthogonal, such that RTθ = R−1

θ and RθRTθ =

RTθ Rθ = 1. So, any rotation matrix will fulfill the following property

XTRTθ RθX = XTX = 1, (4.4)

with X = (Xk, Yk)T . Recall the definition of an EDM (4.1), then together with Equation 4.4

we arrive at

E(RθX) = E(X). (4.5)

In words, Equation 4.5 states that an EDM E remains unchanged in case of rotating the

underlying point set.

Finally, an EDM is invariant against scaling if the matrix is normalized in the range of

[0, 1], otherwise it is scale invariant up to a factor S. Next, we want to show a simple toy

example that shows the invariance properties of an EDM.

Example

Let us show the invariance properties of an EDM by considering the following set of points

X =

3 8

1 2

12 9

6 9

4 17

,

1For the two dimensional case. Note, that in general t can be as well of a higher dimension.2Note, if t is known in advance then it is possible to remove the translation by: E((Xk, Yk)T − t

T ) =E(Xk, Yk).

3The determinant of Rθ is equal to 1. det(R) =

∣

∣

∣

∣

cos(θ) sinθ

−sinθ cos(θ)

∣

∣

∣

∣

= 1.

4.2 Feature Computation 47

then it’s EDM can be written as:

E =

0 6.3246 9.0554 3.1623 9.0554

6.3246 0 13.0384 8.6023 15.2971

9.0554 13.0384 0 6.0000 11.3137

3.1623 8.6023 6.0000 0 8.2462

9.0554 15.2971 11.3137 8.2462 0

.

Next, consider the case of a simple common translation in X:

E(Xk, Yk) = E(Xk + t1, Yk), (4.6)

with t1 = 10. In addition, we will apply the following rotation to our initial EDM:

Rθ =

[

cos(π/3) sin(π/3)

−sin(π/3) cos(π/3)

]

. (4.7)

Then the resulting EDM can be written as follows:

Etransrot =

0 6.3246 9.0554 3.1623 9.0554

6.3246 0 13.0384 8.6023 15.2971

9.0554 13.0384 0 6.0000 11.3137

3.1623 8.6023 6.0000 0 8.2462

9.0554 15.2971 11.3137 8.2462 0

.

Etransrot denotes a rotated and translated version of the original EDM. Next, we will show

the invariance property by the subtraction of the two EDMs:

Etransrot − E = 0.44409 · 10−16. As we can see, the values of both EDMs are almost

identical.

4.2 Feature Computation

Euclidean distance matrices are based on the relations of Euclidean point coordinates. In

an image, many of these points represent noise and are not meaningful enough for an

accurate image representation. In order to overcome this problem, we base EDMs on

the information of meaningful geometric primitives such as line segments (see Chapter 3).

Note, that a line segment is more discriminative than just a single point, since it features an

orientation and a length. Subsequently, we will derive global and local structural features

of line segments.


4.2.1 Global Features

The global features we define in this section are able to capture a holistic representation

for the structural content of an image. Euclidean distance matrices help to encode angular

and spatial relations of line segments. In the end, the global feature vectors are represented

by histograms.

Let L = li|i = 1, 2, ..., N, be a set of line segments obtained from one image, according

to the method presented in Chapter 3. Each line segment li is described by two points in

the Euclidean space, li(pbi(x1, x2), p

ei (x1, x2)), where pb

i and pei are the begin and end points

of a segment with the coordinates x1, x2 ∈ R2. Then we can compute geometric properties

of L such as the angles of all segments between each other, the relative lengths of every

segment and the relative Euclidean distance between all segment mid-points. In detail, the

angle between two segments li and lj is defined as:

cos(θij) =li · lj

||li · lj ||2, (4.8)

with || · ||2 being the L2−Norm. The angle is in the range of [−π, π]. The relative length

of a segment li is the distance−−→pb

ipei and can be written as:

len(li) =

√

(xei − xb

i)2 + (ye

i − ybi )

2

√

(xmax − x0)2 + (ymax − y0)2, (4.9)

where xbi , xe

i , ybi and ye

i denote the coordinates of the segment’s begin and end points. The

denominator is a scaling factor in respect to the longest possible line segment4 with (x0, y0)

and (xmax, ymax) as the begin and end point coordinates.

The Euclidean distance between the mid-points pci and pc

j of the segments li and lj is

defined as

distc(li, lj) =

√

(xcj − xc

i)2 + (yc

j − yci )

2

√

(xmax − x0)2 + (ymax − y0)2, (4.10)

with xci , xc

j , yci and yc

j as the coordinates of the segment mid-points. The denominator

fulfills the same scaling purpose as the one in Equation 4.9. Thus, the relative length of a

segment and the relative distance between two segments is limited to the range [0, 1]. The

relative representation ensures invariance under isotropic scaling. Figure 4.2 illustrates the

geometric properties for a set of line segments.

Now, that the three basic properties of a set of line segments are computed, we can

incorporate this information into Euclidean distance matrices. Thus, the EDMs will rep-

resent the relative geometric connectivity of a set of straight line segments. Specifically,

4The longest possible line segment is as long as the diagonal of the image.


θ13

l2

l3

θ23

l1

θ12

(a) Angles between a set of three segments.

l3

dc23dc

12

dc13

l1

l2

lc2

lc3

lc1

(b) Distances between mid-points for a set of threesegments.

Figure 4.2: Illustration of line segment properties.

we define three EDMs: one based on segment angles Eang (see Equation 4.8) a second one

based on relative segment lengths Elen (see Equation 4.9) and a third one based on relative

distances between segments Edist (see Equation 4.10). The matrix of Eang can be written

as:

Eang =

eang11 eang

12 · · · eang1n

eang21 eang

22 · · · eang2n

......

. . ....

eangn1 eang

n2 · · · eangnn

, (4.11)

and is computed according to

eangij = ‖θi − θj‖, (4.12)

where the values of θi and θj are in the range of [−π, π]. The angles are taken between the

line segments i and j. Similarly, Elen is given by:

Elen =

elen11 elen

12 · · · elen1n

elen21 elen

22 · · · elen2n

......

. . ....

elenn1 elen

n2 · · · elennn

, (4.13)

and calculated by

elenij = [(len(li) − len(lj))

T (len(li) − len(lj))]1

2 = ‖len(li) − len(lj)‖, (4.14)

with len(li) and len(lj) being the lengths of the Euclidean line segments i and j, defined

in Equation 4.9.


The third EDM5 Edist encodes the relative Euclidean distances of all line segment mid-

points.

edistij = [(pc

i − pcj)

T (pci − pc

j)]1

2 = ‖pci − pc

j‖, (4.15)

where pci and pc

j represent mid-points of segments i and j.

Next, we form histograms for the three EDMs. Since every EDM is symmetric (see

Section 4.1), we can extract the upper triangle matrix EDMutriij ∀ i > j. Such that we

obtain three upper triangle matrices: Eangutri

ij , Elenutri

ij and Edistutri

ij . Then we create three

histograms with different resolutions

Hang = hangba

, ba = 1, 2, ..., Ba (4.16)

H len = hlenbl

, bl = 1, 2, ..., Bl (4.17)

Hdist = hdistbd

, bd = 1, 2, ..., Bd, (4.18)

where Ba, Bl and Bd denote the different number of bins of the three histograms. The b-th

bin hb for every histogram is defined as:

hb =1

N2

N∑

i=1

N∑

j=1

1 if the eij element is quantized into the b-th bin

0 otherwise,(4.19)

where eij is used as a synonym for an element of any of the three upper triangle EDMs

Eangutri

ij , Elenutri

ij or Edistutri

ij . The three histograms can be understood as a geometric holistic

representation of a set of segments. The histogram features are invariant against similarity

transformations and changes in illumination (as shown in Section 5.5.2).

In our experiments we have empirically determined the best resolution for the his-

tograms. For Hang it turned out that 72 or 36 bins, that corresponds to a 5 or 10

resolution with respect to angles, produced the best results. The resolution for H len and

Hdist depends more on the application data then Hang. However, we have found out that

15 bins results in a robust and compact histogram feature. In the sequel of this thesis we

denote the global structure features as Hglobal = Hang, H len, Hdist. For the experiments

described in Chapter 5 we use a resolution of 36 bins for Hang and 15 bins for H len and

Hdist, if not indicated otherwise.

4.2.2 Local Perceptual Features

The global features described in Section 4.2.1 encode a complete scene. The advantage

of the global approach is its general applicability. However, for certain tasks such as

for example object recognition, local features play an important role, too. Man-made

5The matrix is similarly defined as in Equation 4.13.


objects such as buildings or cars exhibit many perpendicular and/or parallel line segments.

Therefore, we will introduce local features based on perceptual groups of line segments.

First, we define perceptual groups that are unique, eminent structural entities of line

segments with well defined relations:

• Parallelity

• Perpendicularity

• Diagonality (π4,3π

4)

These groups are formed according to angular relations between segments and will be used

in order to compute geometric relations between their members. Subsequently, each group

will be divided into subsets that are formed according to spatial relations between their

members. In the sequel, we will show that the hierarchical approach of grouping segments

correspondingly to various geometric, angular and spatial relations succeeds in a highly

discriminative local feature as shown in Chapter 5.

Now we form the groups listed above. Specifically, let L = li | i = 1, 2, ..., N, be a

set of line segments extracted from an image according to Section 4.2.1. Then we obtain

several subsets of segments that fulfill specific relations. The first relation is parallelity.

Let L‖ = l‖k | k = 1, 2, ..., N‖ be a set of N‖ parallel line segments. Two segments li and

lj are parallel if they fulfill the following constraint:

(π − ǫ‖) ≤ |θij(li, lj)| ≤ π for i 6= j, (4.20)

0 ≤ |θij(li, lj)| ≤ (ǫ‖) for i 6= j, (4.21)

with ǫ‖ ≤ 5. The parameter ǫ‖ accounts for robustness and quantization errors occurring

from the actual line segment formation process (see Section 2.4).

The second subset of segments is formed by a set of perpendicular segments L⊥ =

l⊥l | l = 1, 2, ..., N⊥ that follow:

(π

2− ǫ⊥) ≤ |θij(li, lj)| ≤ (

π

2+ ǫ⊥) for i 6= j, (4.22)

with ǫ⊥ ≤ 5. ǫ⊥ play’s a similar role as ǫ‖ in Equation 4.20.

The third subset contains diagonal (π4) segments Ldiag45 = ldiag45

m |m = 1, 2, ..., Ndiag45under the condition:

(π

4− ǫdiag45) ≤ |θij(li, lj)| ≤ (

π

4+ ǫdiag45) for i 6= j, (4.23)

with ǫdiag45 ≤ 5.


The fourth subset is created with diagonal (3π4

) segments Ldiag135 = ldiag135

n | n =

1, 2, ..., Ndiag135 following:

(3π

4− ǫdiag135) ≤ |θij(li, lj)| ≤ (

3π

4+ ǫdiag135) for i 6= j, (4.24)

with ǫdiag135 ≤ 5.

The four sets of segments L‖, L⊥, Ldiag45 and Ldiag135 account for well defined structures.

The subsets reflect line segments with certain relations. In fact, we can extract similar

features as we did in Section 4.2.1. Following that methodology, we can compute three

EDMs Eang∗ , Elen

∗ and Edist∗ for each of the four extracted set of segments. Note that the ∗

is a placeholder for the four sets. Specifically, we define the angles between two segments,

the relative segment lengths and the relative distance between two segments according to

Equations 4.8, 4.9 and 4.10 for every subset of line segments. The resulting EDMs are

defined as follows:

Eang∗ : eang∗

ij = ‖θi − θj‖, (4.25)

Elen∗ : elen∗

ij = ‖len(li) − len(lj)‖, (4.26)

Edist∗ : edist∗

ij = ‖pci − pc

j‖, (4.27)

where θi and θj are in the range of [−π, π]. len(li) and len(lj) are the lengths of the line

segments i and j (see Equation 4.9). pci and pc

j represent the mid-points of segments i and

j. Note, that in Equations 4.25, 4.26 and 4.27 the indices i and j differ for every subset

L‖, L⊥, Ldiag45 and Ldiag135 .

Then we create three histograms for every subset of line segments (denoted by the ∗),similarly to Equations 4.16 to 4.18:

Hang∗ = hang∗

ba, ba = 1, 2, ..., Ba, (4.28)

H len∗ = hlen∗

bl, bl = 1, 2, ..., Bl, (4.29)

Hdist∗ = hdist∗

bd, bd = 1, 2, ..., Bd, (4.30)

where each bin is computed similarly as in Equation 4.19. The histograms represent ge-

ometric relation of salient segment subsets. Since three histograms have been formed for

every set, we obtain in total 12 histograms.

We can recognize that geometric relations play an important role on various scales.

Remember, that in Section 4.2.1 we have solely focused on the global scale. To this

point, we have encoded relations of line segment subsets. In the next step we will further

reduce the members of the four sets by additional constraints. Specifically, we extract local

perceptual groups by forming spatially close sets. Let l⊥l be a straight line from the set


of perpendicular segments. And len(l⊥l ) denotes the Euclidean length of the l-th segment.

Then, we search all segments in the set L⊥ that fulfill the following criteria:

dist(l⊥l , l⊥o ) ≤ S · len(l⊥l ), with (4.31)

dist(l⊥l , l⊥o ) = mindist(l⊥l (pb), l⊥o (pb)), dist(l⊥l (pb), l⊥o (pe)), (4.32)

dist(l⊥l (pe), l⊥o (pb)), dist(l⊥l (pe), l⊥o (pe)),

with l, o = 1, 2, ..., , N⊥ and S is a scaling parameter. Thus we obtain a subset of

perpendicular line segments L⊥s

that are spatially close to each other. In fact L⊥s

=

L⊥s

o | o = 1, 2, ..., N⊥s may consist of several local sets, that fulfill the proximity criterion

of Equation 4.31, i.e. there might be clear separated groups (with several members) of

perpendicular line segments. In words, Equation 4.31 searches for all segment that are

within the radial distance of S-times the length of the l-th segment. The radial distance is

computed for all four possible distances between two segments, where the minimal one is

selected. Similarly, we define spatially close subgroups for the other three sets of segments

L‖, Ldiag45 and Ldiag135 as:

dist(l‖k, l

‖p) ≤ len(l

‖k), (4.33)

dist(ldiag45

m , ldiag45

r ) ≤ len(ldiag45

m ), (4.34)

dist(ldiag135

n , ldiag135

s ) ≤ len(ldiag135

n ), (4.35)

where k, p denote any two segments from L‖ and m, r stand for any two segments from

Ldiag45 and n, s are segment indices of Ldiag135 . The distances between segment points are

defined as in Equation 4.31. Thus, we arrive at four subsets of local perceptual segment

groups L⊥s

, L‖s

, Ldiags45 and Ldiags

135 .

We show the formation of segment subsets in Figure 4.3, where we can see 12 arbi-

trary line segments. In order to illustrate the formation of the segment subsets we group

perpendicular segments. The subset L‖ in Figure 4.3 consists (see Equation 4.22) of seg-

ments l2, l5, l6, l9, l10, l11. Next we consider L⊥s ⊂ L‖. Correspondingly to Equation 4.31

segments l6, l10, l2, l5 belong to the set L⊥s

.

Any of the four set6 L∗s

consists of segments that are spatially close. Although the

segments are close to each other, their Euclidean lengths may differ significantly. Thus,

we select additional perceptual segment groups from L∗s

that are of a similar Euclidean

length. Let rlen be a segment length ratio. Then we can define rlen for each of the four

6The ∗ is a placeholder for any of the set L⊥s

, L‖s

, Ldiags

45 or Ldiags

135 .


l3

l1l2

l10

l6l12

l8

l7

l11

L⊥s

1 (l6, l10)

l5

Ldiag45

1 (l9, l12)

L⊥s

2 (l2, l5)

l9

L‖s

1 (l7, l8)

l4

Figure 4.3: Sample groups of line segments that follow certain constraints (see Equa-tions 4.20 to 4.24 and Equations 4.31 to 4.35).

sets as:

rlen⊥ =

len(l⊥s

l )

len(l⊥s

o ), (4.36)

rlen‖ =

len(l‖s

k )

len(l‖s

p ), (4.37)

rlendiags

45

=len(l

diags45

m )

len(ldiags

45

r ), (4.38)

rlendiags

135

=len(l

diags135

n )

len(ldiags

135

s ), (4.39)

with len(l∗l ) ≤ len(l∗o) and 0 ≤ rlen∗ ≤ 1, where ∗ denotes any of the four segment sets. rlen

∗

is a variable threshold for the selection of segment length ratios, e.g. rlen∗ = 1 means that

only lines of equal length will be selected. Thus, rlen∗ is used to select another four sets of

segments: L⊥r

, L‖r

, Ldiagr45 and Ldiagr

135 .

Now, we have obtained two times four sets of perceptual groups of line segments.

The first four sets were extracted according to the spatial distance of segments and the

second four set of segments have been derived on the basis of similarly long segments.


However, we can form a final set L∗rs

: L∗r ∩ L∗s

, that is the intersection of the two

sets. In words, L∗rs

contains all segments that simultaneously fulfill the constraints of

Equations 4.31 to 4.36 and Equations 4.36 to 4.39. Thus, this set represents a further

local perceptual group that follows certain relations. Now we form EDMs and histograms,

as we have already done before. The EDMs and histograms are similarly formed as in

Equations 4.25 to 4.27 and Equations 4.28 to 4.30. Thus, we obtain three sets of four

histograms Hangrs

∗ , H lenrs

∗ , Hdistrs

∗ .To this point we have derived a complete hierarchy of sets of segments. We have

started with the grouping of segments with respect to their angles into four main groups.

Subsequently, we have extracted subsets that are restricted to spatially close lines and to

segments of similar lengths and to sets that fulfill both criteria at the same time, respec-

tively. This highly selective process of grouping segments into perceptual entities assigns

a large fraction of the initially computed lines (L) to perceptual groups. Thus, depending

on the size of the original set, some perceptual groups can be very sparse.

At the begin of this chapter we have mentioned the extraction of perceptual local groups

and their connectivity. The extraction of various local perceptual groups is important for

the determination of highly salient image locations such as for example windows or doors.

Symmetry and repetitive structures are common for man-made objects, e.g. a facade

of a skyscraper exhibits numerous parallel and perpendicular structures (see Figure 2.2).

The various local perceptual groups we have defined so far are capable to encode such

symmetries as shown by the results in Chapter 5.

Next we add connectivity to all perceptual groups. Connectivity adds relations between

the local perceptual groups as shown in Figure 4.4. In the graph we see a set of line

segments. Specifically, we can see two sets of local perceptual groups with two members

each ( L⊥r

and L‖r

). For reasons of a better readability we only show subsets consisting

of two lines such as for example L⊥r

2 . Note that subsets can consist of more than two

segments, if all constraints are met. The dotted connection line linking the two paired

segments L⊥r

2 and L⊥r

1 can be interpreted as a center line of gravity.

Each set L∗r

consists of o-subsets Lo∗r

, with o = 1, 2, ..., N∗r. Compare to Figure 4.4.

Then we take the two closest segments within every set Lo∗r

, where the distance between

segments ll and lo is computed similarly as in Equation 4.31.

Once the two closest segments of a perceptual group are found, we can repeat that for

all other groups. These paired segments can be understood as line segments that exhibit

highly relevant geometric relations between each other.

Now, we introduce connectivity between the local perceptual groups. To this point

we have computed relations between single segments of sets that fulfill certain criteria.

Now we calculate connectivity between paired groups of segments. Let dkp denote the

shortest distance between any two points of the paired segments k and p that is computed


l4

l5

l12 l14

l8

l9

l2

l3

l11

l7

l6

L⊥r

L‖r

L‖r

2 (l2, l3)

l10

l13

l1

d67

L⊥r

1 (l6, l7)

L⊥r

2 (l4, l5)

d45

d89

L‖r

1 (l8, l9)

d23

Figure 4.4: Local perceptual groups of line segments based on Equations 4.31 to 4.35 andEquations 4.36 to 4.39.


according to Equation 4.32. Further, we introduce the mid-point pckp of the connection line

dkp. Specifically, connectivity is defined as the Euclidean distance matrix based on the set

of points pckp, where the indices k and p denote a pair of two segments. In Figure 4.4 we

can see dkp for paired segments such as for example d45.

In order to control the scale of connectivity we introduce a factor C∗ that is the number

of groups that should be taken into consideration, with ∗ denoting all sets of local perceptual

groups. The parameter C∗ can be any value in the range of [1, 2, ..., N∗r

], with N∗r

as the

number of subgroups for every L∗r

and can be understood as a granularity factor. A C∗

of 1 means no connectivity, since only one pair of segments is taken into consideration,

whereas a C∗ equal to the number of subgroups N∗r

indicates full connectivity. Then, we

construct connectivity EDMs according to:

Eangpg

∗ : eangpg∗

ku = ‖θkp − θuv‖, (4.40)

Elenpg

∗ : elenpg∗

ku = ‖len(dkp) − len(duv)‖, (4.41)

Edistpg

∗ : edistpg∗

ku = ‖pckp − pc

uv‖, (4.42)

with θkp and θuv in the range of [−π, π] denoting the angles between the connection lines

dkp and duv of a pair of segments from one of the four local perceptual groups. These

groups were extracted according to Equations 4.36 to 4.39: L⊥r

, L‖r

, Ldiagr45 and Ldiagr

135 ,

summarized with the ∗ notation as: L∗r

. The EDMs defined in Equations 4.40 to 4.42

stand for 12 EDMs, due to the ∗ notation. The 12 EDMs are denoted as follows:

• Eangpg

⊥ , Eangpg

‖ , Eangpg

diag45, E

angpg

diag135,

• Elenpg

⊥ , Elenpg

‖ , Elenpg

diag45, Elenpg

diag135,

• Edistpg

⊥ , Edistpg

‖ , Edistpg

diag45, Edistpg

diag135.

The EDMs contain the connectivity knowledge between paired segments of local per-

ceptual groups. The factor C∗ ensures that all resulting EDMs are of the same size. Note,

that for the case of global features (see Section 4.2.1) the number of elements for each

EDM may varies. That is due to the difference in the amount of detected line segments.

Since, each EDM has a constant number of elements we take every EDM as feature vector.

For a C∗=3 we connect 3 groups of paired segments, resulting in EDMs of size 3 × 3 with

3 unique elements. These numbers form the feature vectors. A the end of this section we

will list all histograms an feature vectors including their resolutions.

To this point we have computed a large number of feature from sets of line segments.

In order to complete the description of sets of line segments we derive some statistical

properties. In addition to the EDM based features we compute other properties from each

set of line segments that we have extracted so far. We have shown that spatial arrangements


of line segments can be encoded by Euclidean distance matrices. The hierarchical approach

of forming subsets with respect to different properties seems reasonable. However, we can

complete the set of features describing local structure by counting the members of each

line segment (sub-) set. Therefore we define a fraction that relates the number of segments

of a subset N∗ to the total number of segments N detected in one image as:

F ∗ =N∗

N, (4.43)

where F ∗ is in the range of [0, 1]. F ∗ equal to 1 indicates that the two sets of segments are

identical in their sizes. Whereas a F ∗=0.5 means that half of all segments belong to a set.

Thus we obtain a single number for each subset of segments. The ensemble of all fractions

forms a vector vfrac that is added to the features. The complete vector can be written

as: vfrac= F ∗s

, F ∗r

, F ∗rs, with ∗ denoting ⊥, ‖, diag45, diag135. Hence, vfrac consists of

3 × 4 numbers. Thus, in addition to the spatial relations of segments, we have obtained a

quantitative measure.

Finally, we form a last histogram feature for our local structure description. The

distribution of the relative segment lengths can account for an image’s structural content.

Our experiments show that there is a strong distinction between images of man-made

objects and landscape or animal images, on the basis of the segment length distribution.

Man-made objects exhibit, in general, longer segments that vary less in their lengths.

Hence, we form histograms for each (sub-) set of segments obtained so far.

Hslen∗ = hslen∗

b , b = 1, 2, ..., Bslen, (4.44)

where Bslen denote the number of bins of the histogram. The b-th bin hb for the histogram

is defined similarly to Equation 4.19. The histogram represents the distribution of the

relative lengths of a set of line segments.

Eventually we have to form the final set of local feature vectors. Therefore, we con-

catenate all histograms and feature vectors obtained so far. The final set of local features

comprises histograms of local perceptual groups of line segments, that encode the rela-

tions between single segments and information of the connectivity of of paired segments

of perceptual groups. In addition the set contains quantitative knowledge of the members

of every subset of segments and histograms that represent the distributions of the relative

segment lengths.

Feature Representation In our experiments we have empirically determined the best

resolution for the histograms. Note, that the histogram resolutions are similar to the

findings for the global features. For Hang∗ and Hangrs

∗ it turned out that 72 or 36 bins, that

corresponds to a 5 or 10 resolution with respect to angles, produced the best results. The


resolutions for H len∗ , Hdist

∗ , H lenrs

∗ and Hdistrs

∗ depend more on the application data then

Hang∗ . However, we have found out that 15 bins result in a robust and compact histogram

features. For Hslen∗ we use a 20 bin resolution and vfrac was described in Equation 4.43.

The complete feature vector is a concatenation of all histograms and vectors:

H local=Hang∗ , H len

∗ , Hdist∗ , Hangrs

∗ , H lenrs

∗ , Hdistrs

∗ , Edistpg

∗ , Elenpg

∗ , Eangpg

∗ , Hslen∗ , vfrac .

Note, that the ∗ denotes: ⊥, ‖, diag45, diag135. For the experiments described in

Chapter 5 we use a resolution of 36 bins for Hang∗ and Hangrs

∗ ; 15 bins for H len∗ , Hdist

∗ ,

H lenrs

∗ and Hdistrs

∗ , if not indicated otherwise. In the sequel of this thesis, H local is refereed

to as local features or local structure-based features. The combination of H local and Hglobal

will be denoted a structure-based or simply structural features.

4.2.3 Summary

In this section we have developed a structure-based feature extraction method that encodes

relative spatial arrangements of line segments. The method determines relations on global

and local scales. Specifically, the global features capture a holistic representation for the

structural content of an image. Therefore, Euclidean distance matrices are computed.

The EDMs encode geometric relations for a set of line segments such as the angles of

all segments between each other, the relative lengths of every segment and the relative

Euclidean distance between all segment mid-points. The feature vector is obtained from

histograms that are computed from the EDMs.

The local features are computed from perceptual groups of line segments that are formed

according to angular relations between segments. Subsequently each group is divided into

subsets that are formed correspondingly to spatial relations between their members. Then,

local perceptual groups are extracted that consist of paired segments. In a subsequent

step we compute connectivity between these groups. Euclidean distance matrices help to

encode the geometric, angular and spatial relations for all sets of segments. In the end a

feature vector is formed that contains histograms of EDMs and histograms that represent

the distributions of the relative segment lengths. In addition the set contains a vector that

describes the fraction of all lines from an image that belong to the various subsets.


4.3 Similarity Measures

In this section we investigate the performance of several similarity measures for the global

and local structure features derived in the previous section. The concept of similarity is

one of the most difficult aspects within content-based image retrieval. The basic question

- ”What is similarity on the image level?” - remains unanswered to a certain degree. One

reason is the individual perception of man. Imagine the following example: Someone wants

to find all images similar to a query image, that show a red car in an urban scene. The

first person want to find just red cars. Another person might want to find the very same

car type, no matter which color. A third person might be not at all interested in the car,

but in the background building, and so on. Hence, an apparently simple looking image

can lead to a set of similar, correct image retrieval results - depending on a specific user’s

intention.

To date, this subjective perception cannot be modelled by a general mathematical

measure of image similarities. Nonetheless, there exists a large choice of image similarity

measures. Good reviews and overview articles can be found in [135], [122], [109], [141], [1],

[146], [74], [125], [98], and [34]. We do not intend to describe all measures in detail. For

our further investigation, we will rather select a set of the most important.

In the sequel, we will describe eight relevant measures and test their performances

on a real world set of images. In general, each image in a database is represented by

a n-dimensional vector v = v1, v2, ..., vn. Then, the similarity between a query image

feature vector q = q1, q2, ..., qn, and feature vectors in the database v is given by a metric

measure7. The similarity measures chosen for this investigation are:

• L1 − norm: The L1 − norm is sometimes also called Manhattan or city block dis-

tance. The histogram intersection is identical to the L1−norm in case of normalized

histograms.

• L2 − norm: A similarity measure between two vectors, based on the Euclidean dis-

tance.

• Matusita: The Matusita distance describes a vector metric, i.e. it fulfills the Cauchy-

Schwartz inequality (generalized triangle inequality).

• Chi−squaremeasure: The Chi-square similarity measure is based on the well known

chi-square test of equality.

• Correlation coefficient: The correlation coefficient can be interpreted as the dot-

product of two standard vectors divided by the rank of these vectors.

7Although some measures do not fulfill all properties of a metric.

4.3 Similarity Measures 61

• Bhattacharyya: The Bhattacharyya similarity measure [6] is often interpreted as a

geometric similarity measure.

• Kullback − Leibler divergence: The Kullback-Leibler divergence [72] is a natural

distance measure between a true probability distribution and an arbitrary probability

distribution.

• Jensen − Shannon divergence: The Jensen-Shannon divergence [23] measures the

similarity between two probability distributions.

The eight similarity measures are computed as follows:

L1 − norm : L1(q,v) =n

∑

i=1

|qi − vi|, (4.45)

L2 − norm : L2(q,v) =

( n∑

i=1

|qi − vi|)1/2

, (4.46)

Matusita : M(q,v) =

√

√

√

√

n∑

i=1

(√qi −

√vi

)2

, (4.47)

Chi − square : χ2(q,v) =n

∑

i=1

(qi − vi)2

qi + vi

, (4.48)

Bhattacharyya : B(q,v) = −ln

n∑

i=1

√qivi, (4.49)

Correlation Coefficient:

r(q,v) =

∑ni=1(qi − q)(vi − v)

√∑n

i=0(qi − q)2√

∑ni=0(vi − v)2

, (4.50)

Kullback − Leibler divergence:

K(p1,p2) =

n∑

i=1

p1(x)log2p1(x)

p2(x)

=n

∑

i=1

p1(x)log2p1(x) −n

∑

i=1

p1(x)log2p1(x). (4.51)

Jensen − Shannon divergence:

JS(p1,p2) = H(π1p1 + π2p2) − (π1H(p1) + π2H(p2)), (4.52)


with π1 and π2 as weights of the distributions, satisfying π1 + π2 = 1 and

H(p) = −n

∑

i=1

pilog2pi, (4.53)

denotes the Shannon entropy of the probability distribution p = p1, p2, ..., pn. In con-

trary to the Kullback-Leibler divergence, the Jensen-Shannon (JS) divergence is symmetric,

always well defined and bounded.

Next we apply the eight measures to an image retrieval problem of ancient Water-

marks, in order to determine the best one. As features we use the global and local set

Hglobal, H local derived in Section 4.2.

In Figure 5.5 we can see the comparison between all previously defined measures. The

eight measure have been applied to the two classes Cup and Bull Head of the of the ancient

Watermark database (see Section 5.1.1). The graph shows two class-wise averaged recall

versus number of retrieved images plot. We can see that on the average the Intersection

measure performs best for both classes. The Bhattacharyya and the Correlation Coefficient

measure are second best followed by the other measures.

4.4 Data Normalization

Normalization is a process that changes the range of data. For image processing appli-

cations, including image representation, matching, CBIR and classification methods data

usually are normalized to the range of [−1, +1] or [0, 1]. This compact feature represen-

tation improves subsequent data processing such as training a classifier or the matching

of similar images in CBIR applications. Moreover, several authors reported significant

improvements of classification or image retrieval results under the usage of proper nor-

malization methods [3], [2] and [41]. For the data represented in this thesis we use the

following normalization methods:

• Linear scaling to unit range [0, 1]: Set a lower bound l and an upper bound u for a

feature vector

vn =v − l

u − l. (4.54)

• Zero mean and unit variance normalization to the range [−1, 1]: Compute the mean

µ and standard deviation σ of the feature vector as in [59]:

vn =v − µ

σ. (4.55)

4.4 Data Normalization 63

200 400 600 800 1000 1200 1400 16000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all

# retrieved images

Class: Cup

Intersection

L2−norm

Chi−square

Matusita

Bhattacharyya

Corr−coef.

Kullback Leibler

Jensen Shannon

200 400 600 800 1000 1200 1400 16000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rec

all

# retrieved images

Class: Bull Head

IntersectionL2−normChi−squareMatusitaBhattacharyyaCorr−coef.Kullback LeiblerJensen Shannon

Figure 4.5: Class-wise averaged recall versus number of retrieved images plot with eightdifferent similarity measures for two classes of the ancient Watermark database (see Sec-tion 5.1.1).


Note the different range for the second normalization method. It should be mentioned that

the data can be mapped by a probability of 99% to the unit range [0, 1] by a simple shift

and rescaling operation [2]:

vn =v−µ3σ

+ 1

2. (4.56)

• Rank normalization uniformly maps a feature vector in the range [0, 1]:

vn =rankvn

vi(v) − 1

R − 1, (4.57)

where R is the maximum number of ranks for each feature vector v; R = max(r), with

r = 1, 2, ...n and n being the number of elements of v. Note that rankvnvi

(v) performs a

ranking, i.e. sorting the feature vector elements in an ascending order. Moreover, we could

observe that the choice of the normalization method can have a significant impact on the

results.

4.5 Feature Space Representation

It is a common practice to assume that a feature space follows the normal distribution.

Rarely this assumption is proven, nor it can be said how often the presumption holds.

The authors in [110] observed that a uniform data distribution of feature sets results in

a better retrieval for the Euclidean distance measure. Further they found, if the common

assumption of a Gaussian feature space distribution does not hold, then the Euclidean

distance methods fails to work effectively. Similarly [102] argued that assuming a Gaussian

feature distribution is an effective means to normalize features for a similarity search.

Hence, it is easy to recognize the importance of a Gaussian like features in order to ensure

the best possible results for Euclidean distance-based similarity measures.

Therefore, we have set up an experiment that checks for the real feature space distri-

bution with the help of goodness-of-fit tests. Various goodness-of-fit tests are reviewed in

[12], [94] and [24].

In order to test the real distribution of our features we use the Kolmogorov-Smirnov

(KS) test [67] that makes no assumption about the data distribution and thus is non-

parametric and can be applied to any kind of distribution. This is a clear advantage

over other tests such as for example the χ2-test which assumes that the sampling error

between the two empirical density functions has a normal (Gaussian) distribution. Each

goodness-of-fit test involves the examination of a random sample from a dataset with a

known underlying distribution such that to test for the null hypothesis. The null hypothesis

states that the unknown distribution function is of the same kind as a known specified one.

4.5 Feature Space Representation 65

In the following Paragraph we will shortly summarize the most important aspects of the

KS-test.

Kolmogorov-Smirnov Test Let F2(x) be the theoretical cumulative distribution8 to

be tested which must be fully specified. Further, F1(x) is the empirical distribution or

empirical cumulative distribution describing the distribution of data to be tested. For the

actual test we have to state the null hypothesis H0 and the alternative hypothesis H1.

H0 : f1(x) = f2(x) −∞ =< x <= ∞ (4.58)

H1 : f1(x) 6= f2(x) −∞ =< x <= ∞, (4.59)

with f1(x) as the sample probability density function. Then the sample probability distri-

bution functions F1(x) and F2(x) are given by:

F1(x) =

N1∑

x=1

f1(x) (4.60)

F2(x) =

N2∑

x=1

f2(x), (4.61)

where N1 and N2 are the number of elements for F1(x) and F2(x), respectively. Moreover,

the intervals for both probability distribution functions are normalized by the number of

samples to the range [0, 1]. Then the test statistic Dn is given by:

Dn = supxn=1≤xn≤xn=N

|F2(xn) − F1(xn)|, (4.62)

where Dn denotes the maximum difference between the observed empirical distribution

F2(xn) and the theoretical cumulative distribution F1(xn). Thus, we obtain a measure of

difference between the theoretical model and the experimental data. Finally, the KS-test

determines for a specified level of significance α whether the null hypothesis is rejected or

fails to be rejected. For Dn < Dαn the hypothesis is correct and for Dn ≥ Dα

n the hypothesis

is rejected. Dαn can be found in statistical tables.

Experiment The null hypothesis H0 for our experiment is that our feature space X,

has a standard normal distribution, that is, a normal distribution with zero mean and

a variance equal to one. The alternative hypothesis H1 is that X does not follow that

distribution. We chose a level of significance of α = 0.05. Then, the KS-test results in

8Cumulative distribution functions (CDF) have the property of being monotone increasing and contin-uous from the right. Additionally, all CDF follow: lim

x→−∞F (x) = 0 and lim

x→+∞F (x) = 1.


a value Dn of one, if we can reject the null hypothesis that X follows a standard normal

distribution, or in a Dn equal to zero if we cannot reject the null hypothesis.

We decided to test the feature space distribution of the histograms derived from the

angle based Euclidean distance matrix, defined in Equation 4.12. The 36-bin histogram

Hang was defined in Section 4.2.1. The feature space X consists of 2662 feature vectors,

obtained from the Caltech image database (see Section 5.1.3). The distribution of the

feature space X corresponds to F2(xn) in Equation 4.62 and F1(xn) is a standard normal

distribution. Then we obtain a KS-test result of Dn = 0. Hence, we cannot reject the null

hypothesis H0 for our feature space X. That means that the tested feature space X follows

a standard normal distribution.

4.6 Summary

In this chapter we have derived a structure-based feature extraction method. The method

determines relations on a global and local scale. The method is capable of representing

the global structure of an image, as well as the local structure of perceptual groups and

their connectivity. The advantage of the method is the invariance towards changes in

illumination, similarity transformations (rotation, translation and isotropic scaling) and

slight changes of the viewing angle, as shown in Chapter 5.

Next, we have compared eight similarity measures under the usage of the global and lo-

cal features. The measure have been applied to images of the ancient Watermark database

(see Section 5.1.1). The retrieval results reveal that in average the intersection measure

performs best for our features. We have also discussed the normalization of our features

in Section 4.4. The chapter was concluded with a discussion of the feature space rep-

resentation. Therefore we have used the Kolmogorov-Smirnov test in order test the real

distribution of our features.

In the next chapter we present classification and image retrieval results for several

applications.

Chapter 5

Applications

In this chapter we present several image classification and content-based image retrieval

applications. First we will describe the datasets. Then we will discuss the performance

measures we are using for the evaluation the our results. Since support vector machines

(SVM) are used for the classification tasks, we will shortly review their fundamental prop-

erties. In the sequel, we will apply our previously introduced features to the following

problems:

• Ancient watermark retrieval and classification (Filigrees)

• Color image retrieval (Corel database)

• Object class recognition and retrieval (Caltech database)

• Texture class recognition (Brodatz collection)

For each application we present a comprehensive evaluation of the results with the per-

formance measures defined in Section 5.2 followed by a detailed discussion of the results.

Various sample retrieval results give the reader a visual understanding of our results. The

presentation of the results is completed by an invariance analysis of our features under

various image transformations.

5.1 Data Description

5.1.1 Ancient Watermark Images

Watermarks in papers have been in use since 1282 in the Medieval Europe. Watermarks

can be understood much in the sense of being an ancient form of a copyright signature. In

fact, watermarks served as a mark for the paper mill that made the sheet. Thus, it served

68 Applications

as a unique identifier and as a quality label1. Nowadays, scientists from the International

Association of Paper Historians (IPH) try to identify unique watermark in order to get

known the evolution of commercial and cultural exchanges between cities in the Middle

Ages. [55].

The physical creation of watermarks2 is described in detail by [28] as follows;

A watermark is formed by the attachment of a wire design to the mesh surface

of the papermakers’ mould. During paper production the paper pulp is scooped

from a vat onto the surface of the mould and the excess water is allowed to

drain away through the mesh. The wire watermark, which sits proud of the

mesh, reduces the density of fibers deposited on that area of the mould, and

when the finished sheet is viewed with transmitted light, the area where the wire

had been present is thinner and appears lighter than the remainder of the sheet.

The interest of the IPH association lies specifically in the determination of the origin and

date of creation of an ancient unknown paper by comparison with known documents bearing

a similar watermark signature [114]. It is then possible to determine whether this unknown

paper comes from the same region and approximately the same period as the reference

watermark. The internationally known Swiss Paper Museum in Basel houses thousands of

images of historical papers as well as ancient watermarks. There are approximately 600.000

known watermarks and their number is steadily growing. Note, that the variety of their

shapes and forms is abundantly large, where the early watermarks often exhibit sanctuary

symbols, geometric figures, animals, crescents, etc. The Swiss Paper Museum in Basel has

provided us with a subset of their digital Watermark database. The database used in the

subsequent experiments consists of about 1800 images (see Table 5.1). Figure 5.1 shows

scanned sample watermark images. A detailed description of the scanning setup can be

found in [112]. In fact, the watermarks are digitized from the original sources. Specifically,

each ancient document has to be scanned three times (front, back and by transparency)

in order to obtain a high quality digital copy, where the last scan contains all necessary

information [112]. A semi-automatic method, that is describe in [112], delivers the final

images. The method incorporates a global contrast, contour enhancement and grey-level

inversion. Figure 5.3 shows sample images after the method was applied. Subsequently,

these images are used in order to extract descriptors and features (described in Chapter 4),

for retrieval or classification problems.

1Modern watermarks serve as security identifiers, such as on banknotes or important documents.2The actual shape of the watermarks can be seen when the paper is held up to the light.

5.1 Data Description 69

Figure 5.1: Samples of scanned ancient watermark images (courtesy Swiss Paper Museum,Basel).

70 Applications

Figure 5.2: Sample filigrees of each class from the Watermark database after enhancementand binarization (see [112]). The classes are according to Table 5.1 (starting from topleft): Eagle, Anchor1, Anchor2, Coat of Arm, Circle, Bell, Heart, Column, Hand, Sun, BullHead, Flower, Cup and Other objects.


Figure 5.3: Sample filigrees from the Watermark database after enhancement and bina-rization (see [112]). Each of the four rows show watermarks from the same class, namelyHeart, Hand Eagle and Column. The samples show the large intra-class variability withinthe Watermark database.

72 Applications

Table 5.1: The classes of the Watermark database.

Class Nr. Class-Name # of Class-Members1 Eagle 3222 Anchor1 1153 Anchor2 1394 Coat of Arms 715 Circle 916 Bell 447 Heart 1978 Column 1269 Hand 9910 Sun (various types) 3311 Bull Head 1412 Flower 3113 Cup 1714 Other objects 416

Total 1715

5.1.2 Corel Databases

We used two databases from the Corel image collection. The first database is a widely

used subset of the Corel database. We utilize this image collection, since there are several

results published [82], [143] and [25]. The dataset consists of 1.000 images containing 10

class with 100 color images each and can be downloaded at [22]. The 10 image classes

are Africa, Beach, Buildings, Buses, Dinosaurs, Flowers, Elephants, Horses, Food and

Mountains. Although the database size is quite small for today’s standards it became de

facto the benchmark for CBIR applications. Note that the class labels were assigned by

Corel and were not altered by us. The second dataset contains 10.000 images and was

used for a further evaluation of our results. The collection was created by [87], with images

from different publicly available image databases, i.e. the dataset has a large variation of

different image formats and image acquisition settings. The set is made up of 25 different

classes with 400 color and gray-scale images each. For a better understanding of the

database used in the experiments, Fig. 5.4 shows some sample images taken out of the two

image databases.

5.1.3 Caltech Database

We decided to use the Caltech database [32] since several authors have reported results for

this image collection or for subsets of it. In addition, the Caltech image set exhibits various


Figure 5.4: Sample images from the two Corel databases, where the images in the first roware randomly taken from the 1.000 image set and the images in the second row from the10.000 image database.

74 Applications

Figure 5.5: Two sample images of each class from the Caltech database.

different classes and, thus, is suitable for classification and retrieval tasks. In detail, we

use five classes for our experiments that are airplane (class 1), cars (class 2), faces (class

3), leaves (class 4) and motorbikes (class 5). In total the dataset consists of 2662 images

where each class contains a different number of members. Figure 5.5 shows two sample

images of each class. It can be seen that for each image the background may heavily vary.

In addition, many images differ significantly in their sizes. In order to assure comparable

results we did not alter, remove or modify any images.

5.1.4 Brodatz Database

The Brodatz database [11] consists of 13 textures shown in Figure 5.6. Each 512×512

image is digitized under six different rotations (0, 30, 60, 90, 120 and 150). Moreover,

every image is subdivided into 16 non-overlapping images of the size 128×128. Thus, the

whole database consists of 1248 texture samples belonging to 13 classes with 96 members

each.

The 13 classes contain the following textures Bark (class 1), Brick (class 2), Bubbles

(class 3), Grass (class 4), Leather (class 5), Pigskin (class 6), Raffia (class 7), Sand (class

8), Straw (class 9), Water (class 10),Wave (class 11), Wood (class 12), Wool (class 13),

5.2 Performance Evaluation

For the results presented in this thesis we use several performance measures. In the follow-

ing two subsections we will describe measures for classification and content-based image

retrieval tasks.

5.2 Performance Evaluation 75

Bark Brick Bubbles Grass

Leather Pigskin Raffia Sand

Straw Water Weave Wood

Wool

Figure 5.6: Sample images from the 13 classes of the Brodatz image database.

76 Applications

Table 5.2: Sample confusion matrix for a two class problem.

Predicted ClassNegative Positive

Negative CM1 CM2Actual ClassPositive CM3 CM4

5.2.1 Measures for Classification

It is common to represent classification results by a so-called confusion matrix. Such a

matrix typically is of size N × N , where N specifies the number of classes and displays the

actual and predicted class labels obtain by a classifier. In order to describe the properties

of a confusion matrix we assume N = 2, which means we consider a a two-class problem.

Table 5.2 shows a schematic representation of a two-class confusion matrix, where each

column and row shows the predicted and actual class instances, respectively.

More precisely, CM1 and CM4 display the number of correct predictions for negative

and positive cases, respectively. Similarly, CM2 and CM3 display the number of incorrect

predictions for negative and positive cases, respectively. In order to evaluate the content

of a confusion matrix the following measures can be derived:

• True Positive rate (TP)3: Fraction of correctly classified positive cases.

• False Positive rate (FP): Fraction of incorrectly classified positive cases.

• Accuracy (AC): Fraction of all classifications that were correct.

• True Negative (TN): Fraction of correctly classified negative cases.

• False Negative (FN): Fraction of incorrectly classified positive cases.

• Precision (P): Fraction of positive cases that were correctly classified.

• Geometric Mean1 (G−Mean1) and Geometric Mean2 (G−Mean2) [71]: Geometric

mean of TP with P and TP with TN4.

• F-Measure (F) [79]: Weighted combination of true positive and precision.

F(β2 + 1) ∗ P ∗ TP

β2 ∗ P + TP

3Sometimes referred to as Recall.4Note, that the G-Mean is high when TP and TN are large and their difference is small.

5.2 Performance Evaluation 77

With respect to the two class confusion matrix shown in Table 5.2 we can define the above

measures as follows:

TP =CM4

CM3 + CM4, (5.1)

FP =CM2

CM1 + CM2, (5.2)

AC =CM1 + CM2

CM1 + CM2 + CM3 + CM4

, (5.3)

TN =CM1

CM1 + CM2

, (5.4)

FN =CM3

CM3 + CM4

, (5.5)

P =CM4

CM2 + CM4

, (5.6)

G − Mean1 =√

TP ∗ P, (5.7)

G − Mean2 =√

TP ∗ TN, (5.8)

F =(β2 + 1) ∗ P ∗ TP

β2 ∗ P + TP, (5.9)

with β ∈ [0, 1], where β = 1 means that TP and P are weighted equally strong. Thus, the

F-measure can be seen as the combination of precision and recall into a single expression,

where β indicates the relative importance of recall and precision. All measures can also

be computed for multi-class problems. In general, for classes with different number of

members as simple averaging might distort the performance measures. Therefore, it is

advisable to take the class size into account. This can be accomplished by a weighted

averaging with the class sizes as weights.

Mw =

∑Ni=1 wi ∗ Mi∑N

i=1 wi

, (5.10)

where Mi may be any measure from Equations 5.1 - 5.9.

5.2.2 Measures for Content-based Image Retrieval

In order to evaluate CBIR results we use precision and recall graphs. Precision, the likeli-

hood that a retrieved image is relevant and recall, the likelihood that a relevant image has

78 Applications

been retrieved are defined as follows:

Recall =retrieved and relevant images

all relevant images in the database(5.11)

Precision =retrieved and relevant images

number of retrieved images(5.12)

In most situations, the number of class members will vary. Then it is difficult to compare

precision and recall numbers between different classes or databases. For these cases the

average rank 5 is not meaningful enough, since the database size N and the number of

relevant images Nr may differ. Thus, a quantity that normalizes by N and Nr is necessary.

That is the normalized average rank introduced in [95] and defined as follows:

Rank =1

N · Nr

(( Nr∑

i=1

Ri

)

− Nr(Nr − 1)

2

)

, (5.13)

where Ri is the (class-wise) averaged rank. In contrary to many CBIR measures Rank = 0

indicates a perfect performance. The closer Rank approaches 1, the worst the performance

gets.

5.3 Support Vector Machines

During the last years support vector machines (SVM) became a very popular technique for

classification problems. The basic concept was introduced in [139] and triggered a large

amount of fundamental findings and practical applications such as in the field of image

classification [48], [21]. For a detailed discussion of we refer to interested reader to [140],

[16] or [123]. To date, the distribution of various easy to use software packages [20], [118]

helped to spread SVM technique to fields as different as computer science, biological and

medical research, engineering and economy.

In contrary to clustering algorithms which are unsupervised learning methods (see

Chapter 3), SVMs belong to supervised learning techniques. That implies the usage of

training and testing data, where the former serves as input for the learning function and

the latter is used to test the generalization ability of the classifier towards unseen data

samples. In fact, the learning algorithm generates a model that should be able to accu-

rately map input objects to desired outputs6. Next, we will shortly summarize the basics

of SVMs.

5Simple average over the ranks of all relevant images6For most classification tasks the output is a class label.

5.3 Support Vector Machines 79

In detail, a support vector machine is based on classifying hyperplanes

(w · x) + b = 0, (5.14)

with w ∈ Rn and b ∈ R, where the optimal hyperplane can be found by margin maximiza-

tion.

Assume a set of labeled feature vectors (xi, yi), i = 1, 2, . . . , k, with xi ∈ Rn and

yi ∈ −1, +1k, where the SVM calculates the optimal hyperplane that separates the

feature vectors xi with label yi = 1 from the feature vectors xj with label yj = −1. The

resulting hyperplane is optimal in sense of maximal margin with respect to both of the

classes. In fact, a hyperplane can be described by just a few feature vectors, the so-called

support vectors. Moreover, the hyperplane is parameterized by a vector w, normal to the

hyperplane, and an offset value b. The actual classification is performed by:

y = sgn(wTx + b). (5.15)

Note, that the decision function only depends on dot products between patterns. Next, we

will briefly show the calculation of the hyperplane for a simple linear separable two class

problem.

Calculating the hyperplane: In order to compute the hyperplane we normalize the

parameters w and b such that

min∣

∣wTx + b∣

∣ = 1. (5.16)

Thus, the distance from any feature vector x to the hyperplane can be computed by

z =|wTx + b|

‖w‖ . (5.17)

Specifically, the margin measured perpendicularly to the hyperplane is given by 1‖w‖

+ 1‖w‖

=2

‖w‖. Finally, the following optimization problem has to be solved:

Minimize1

2‖w‖2, (5.18)

subject to yi · (wTxi + b) ≥ 1, i = 1, 2, . . . , k. (5.19)

This quadratic optimization problem can be solved by the Lagrangian formalism

80 Applications

maxλ≥0 (minw,b (L(w, b, λ))), with

Minimize L(w, b, λ) =1

2‖w‖2

−n

∑

i=1

λi ·[

yi(wTxi + b) − 1

]

, (5.20)

with λi ≥ 0. In words, L(w, b, λ) has to minimized with respect to w, b and at the same

time the derivatives of L(w, b, λ) with respect to λi have to vanish, under to constraints of

λi ≥ 0. Note, that Equation. 5.20 states a convex quadratic programming problem with

a convex objective function. Thus, it is permitted to equivalently solve the so-called dual7

problem, that is, maximize L(w, b, λ)8. The solution is characterized by a subset of all

training patterns. In detail, all patterns with non-zero λi, i.e. the support vectors form

the optimal solution. Thus, only the closest patterns contribute to the determination of

the optimal hyperplane. Note, that all other training patterns do not contribute to the

solution.

The non linear separable case

The next step is to generalize the SVM method to non-linear decision functions. It appears,

that the training patterns are only represented as dot products in the dual picture of

Equation (5.20). The idea is to first map all data patterns to a Hilbert space H , such that

there exists a mapping

x ∈ Rn 7→ z ∈ Rm,

from the input feature space to a m-dimensional space.

In the Hilbert space the input pattern appear in the training algorithm only in forms of

dot products. Assume, the following function exists K(x, z) = Φ(x)Φ(z) and call it kernel

function. Then we would only need to use K in the learning algorithm. In fact, Mercer’s

theorem (see Appendix 1) states that positive definite kernels can be represented by a set

of basis functions and makes it possible to learn in the feature space without the explicit

knowledge of Φ. Any Mercer kernel9 describes a scalar product in a high dimensional space.

Thus, we simply exchange the scalar product by kernel functions K(x, z). Commonly used

kernels K(x, z) are:

7Often referred to as Wolfe dual problem.8The final dual optimization problem can be written as:

maxλ

n∑

i=1

λi −1

2

∑

i,j

λiλjyiyjxTi xj

subject to

n∑

i=1

λiyi = 0; λ ≥ 0.

9Each kernel function with the properties of Theorem 1.

5.3 Support Vector Machines 81

• Linear

K(x, z) = xTz. (5.21)

• Polynomials

K(x, z) = (xTz + 1)q, q > 0. (5.22)

• Radial Basis Functions

K(x, z) = exp

(

−‖x − z‖2

σ2

)

. (5.23)

• Histogram Intersection

K(x, z) =

n∑

i=1

minxi, zi. (5.24)

• Hyperbolic Tangent

K(x, z) = tanh(

βxTz + γ)

, for β, γ such that Mercer’s condi-

tions 1 are satisfied.

(5.25)

Thus, under the consideration of Mercer’s Theorem the whole formalism of the linear

separable case holds for the non-linear separable problem.

Non separable case

In general, a data set may consist of non-separable patterns. Then applying the above

algorithm will not succeed in a correct solution. Outliers may be caused by noise and

should not have a big influence on the resulting hyperplane. A possibility to overcome

these influences is to “soften” the margin, i.e. to allow misclassification of training samples.

For this purpose so-called positive slack variables can be introduced ξi, i = 1, 2, . . . , n,in the constraints. Then the constraints can be relaxed as:

wTxi − b ≤ −1 + ξi, ∀ yi = −1, (5.26)

wTxi − b ≥ +1 − ξi, ∀ yi = +1, (5.27)

ξi ≥ 0, ∀ i. (5.28)

Thus, we allow misclassification which should in turn be penalized. Therefore, a penalty

function C will be introduced in the convex optimization problem.

82 Applications

The final optimization problem can be written in the Wolfe dual representation accord-

ing to [16] as:

Maximize: LD(w, b, λ) =∑n

i=1 λi −1

2

∑

i,j

λiλjyiyjxTi xj ; (5.29)

subject to 0 ≤ λi ≤ C, (5.30)n

∑

i=1

λiyi = 0; λ ≥ 0. (5.31)

C is a positive parameter that controls the penalization of outliers. The lower C, the more

outliers are allowed. The final solution appears to be:

w =Ns∑

i=1

λiyixi, (5.32)

where Ns denotes the number of support vectors. Note, that λi has an upper bound of

C10.

In the sequel of this chapter we will use SVMs for the classification of image classes.

5.4 Filigrees

In this section we discuss the application of the structure-based features (see Chapter 4),

to the ancient Watermark database (see Section 5.1.1). First, we will review related work.

Then we present the retrieval results followed by the classification of the 14 watermark

classes.

5.4.1 Related Work

To date, there have been attempts to classify and retrieve watermark images, both by

textual- and content-based approaches. Del Marmol [88] classified ancient watermarks after

the date of creation. Briquet [10] manually classified about 16.000 watermarks in about

80 textual classes. Pure textual classification systems might be error prone. Watermark

labels and or textual descriptions might be very old, erroneous or just not detailed enough.

Moreover, many watermarks are not labeled and, thus, can not be used for textual queries.

Therefore, more recent attempts have been undertaken in order to focus on the real content

of watermark images. In [113] the authors describe the set-up of a system capable to add,

edit or remove watermarks with the possibility of textual and content-based retrieval.

As described in the paper, the authors used a 16-bin large circular histogram computed

10The upper bound is the only difference from the optimal hyperplane case.

5.4 Filigrees 83

around the center of gravity of each watermark image. In addition, eight directional filters

were applied to each image and used as a feature vector. The algorithms were tested on

a watermark database consisting of 120 images, split up into 12 different classes. The

system achieved a probability of 86% that the first retrieved image belongs to the same

class as the query image. A different approach was taken by the authors in [115] and [78]

who used three sets of various global moment features and three sets of component-based

features. The latter feature set consists of several shape descriptors which are extracted

from various image regions. For testing purposes, the authors selected a database consisting

of 806 tracings of watermarks from the Churchill collection. The authors set-up a retrieval

system with the city-block distance as discrimination measure. The system was evaluated

with normalized and averaged precision (Pn) and recall (Rn) scores. Therefore, 15 images

and their ground-truths have been manually selected. The authors report a precision of 0.53

and a recall of 0.81 for their best features, obtained from the original grey-level images. By

introducing a threshold11 the results could be improved. Finally, the authors conclude that

for their database, global features work better then the component-based (local) features

extracted from various regions.

5.4.2 Retrieval Results

In this subsection we show the results obtained from the medium size set of filigree images.

The usage of line segments and the computation of global and local features of their

arrangements succeeded in a highly performant method, as the results show.

Features

We use the global and local features as described in Chapter 4. The feature set Hglobal, H localis normalized accordingly to Equation 4.54. The features are computed offline for the com-

plete database. At retrieval time, only the feature vector for the query image has to be

computed. The retrieval results are obtained with the intersection similarity measure (see

Section 4.3).

First, we will show several retrieval results in order to give a better impression of the

watermark images. The results show that most classes exhibit a high intra-class variability.

However, the retrieval results will be followed by a thorough performance analysis (see

Section 5.2).

Figure 5.7 shows a set of 20 watermark images. The first image is the query, the second

one is the identical match, indicated by the 1 above the image. The subsequent images are

sorted in decreasing similarity, as it is indicated by the numbers above each image. As the

numbers indicate, most images are very similar. It is interesting to observe that most of

11The authors do not explain how the threshold value was obtained.

84 Applications

1 0.959415 0.955348 0.952716

0.952052 0.952001 0.951644 0.949675 0.949298

0.948088 0.947644 0.942182 0.941482 0.940527

0.937491 0.937469 0.934245 0.934234 0.932529

Query

Figure 5.7: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Anchor1 of the Watermark image database.

5.4 Filigrees 85

the retrieved anchors show the same orientation except the last image in the second last

row. A closer look at the query image reveals that it is featured with a tiny cross atop and

with cusp-like structures at the outer endings12. The retrieved images clearly show that

both of these small scale structures are present in all (except one image) of the displayed

images.

In Figure 5.8 we can see another retrieval result. The similarity values show a faster

decrease in comparison to Figure 5.7. A closer look at the retrieved images reveals that

the actual watermarks do contain slightly different substructures, such as the segment of a

circle in the query image that is hardly to find in the other filigrees. The last image in the

second last row states a mis-match from the class Anchor1. We argue that the anchor’s

basic structure is quite similar to the query image in terms of the geometric arrangements

of the line segments.

Figure 5.9 displays a sample query from the class Column. We can see that all images

belong to the query image’s class, except the second last image which is out of the class

Heart. Visually, it is not clear to us why the heart image is retrieved before another

column image. However, it is not always possible to translate visual appearance into the

high dimensional feature space, where the actual image similarity is computed.

In Figure 5.10 we can see a retrieval result from the class Flower. Again, the first

retrieved image is identical with the query image. Although the class Flower consists only

of 31 images13, all displayed images belong to the same class. The good retrieval result

might come from the smaller intra-class variation in comparison to the other classes. Note,

that for the class Flower the classifier14 obtained a true positive rate of 100%.

Figures 5.11, 5.12 and 5.13 are of special interest. Each figure displays sample retrieval

results, where the query images of Figure 5.11 and 5.12 belong to the class Eagle and

the query watermark from Figure 5.13 is out of the class Coat of Arms. It came to our

attention that images from the class Coat of Arms did not perform very well in terms of

precision-recall graphs displayed in Figure 5.14. A visual inspection of all members of the

classes Coat of Arms and Eagle explained the drop in performance. The first and more

important reason is that eagle motives are very common in heraldry, i.e. about half of

the members of the class Coat of Arms have some kind of eagle embedded on a shield or

armorial bearings. Such that eagles and embedded eagles are retrieved at the same time

as shown in Figure 5.12, where the query filigree belongs to the class Eagle. Among the

retrieved watermarks we can find several images from the class Coat of Arms. The sample

retrieval in Figure 5.13 shows the same phenomenon, just reversed. Here, the query image

12Note, that the class Anchor1 possesses a large intra-class variation of shapes, i.e. many anchors haveno crosses or show very different endings.

13The class Flower is the third smallest class of the database. The size of the other classes are displayedin Tab. 5.1

14See Table. 5.8

86 Applications

1 0.922194 0.920298 0.901907

0.895846 0.893444 0.889845 0.887541 0.886112

0.885645 0.878217 0.875053 0.870906 0.870492

0.870481 0.8677 0.865952 0.86308 0.861701

Query

Figure 5.8: Sample retrieval result of the class Circle from the Watermark database, underthe usage of global and local structural features (see Section 4.2).

5.4 Filigrees 87

1 0.949537 0.947515 0.939641

0.934524 0.934285 0.933701 0.93253 0.932393

0.926542 0.92593 0.924799 0.921137 0.921017

0.920758 0.919269 0.915287 0.912817 0.911982

Query

Figure 5.9: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Column of the Watermark database.

88 Applications

1 0.96195 0.961072 0.957413

0.9530250.951513 0.945948 0.945852 0.945689

0.943605 0.943526 0.943496 0.94102 0.940977

0.9399650.93513 0.934169 0.933677 0.931049

Query

Figure 5.10: Sample retrieval result of the class Flower from the Watermark database,under the usage of our structural features (see Section 4.2).

5.4 Filigrees 89

1 0.9705360.968812 0.968684

0.967473 0.966818 0.966298 0.965719 0.965699

0.965482 0.964 0.962416 0.962366 0.962107

0.961854 0.961122 0.960781 0.959923 0.959723

Query

Figure 5.11: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Eagle of the Watermark database.

90 Applications

1 0.960806 0.95905 0.958831

0.95785 0.957334 0.957211 0.956785 0.956766

0.956285 0.956039 0.95537 0.954302 0.95423

0.954229 0.953734 0.953506 0.953504

0.953257

Query

Figure 5.12: Sample retrieval result obtained with our structure-based features (see Sec-tion 4.2) from the class Eagle of the Watermark database.

5.4 Filigrees 91

1 0.975298 0.973771 0.971675

0.971372 0.971086 0.970558 0.970454 0.96916

0.969115 0.968256 0.967976 0.967299 0.966694

0.96669 0.966388 0.965942 0.965742 0.964802

Query

Figure 5.13: Sample retrieval result of the class Coat of Arms from the Watermarkdatabase. As features we have incorporated our global and local structre-based features(see Section 4.2).

92 Applications

belongs to the class Coat of Arms. The second reason we could identify is the difference

in size of the two classes. In fact, they approximately vary by a factor of five in their

sizes. Thus, it is more probable to observe eagles retrieved by queries from the class Coat

of Arms (see Figures 5.12 and 5.13), then it is the other way round (see Figure 5.1115). A

quantitative proof for our observations can be found in the confusion matrix in Tab. 5.8,

where for example 31 images from class four (Coat of Arms) are classified as class one

(Eagle). In fact, we could observe similar occurrences for some of the other classes.

15The figure shows a sample retrieval of eagles. Note, that for this query watermark only filigrees fromthe same class have been retrieved.

5.4 Filigrees 93

500 1000 15000

0.2

0.4

0.6

0.8

1Eagle

Rec

all

# retrieved images500 1000 1500

0

0.2

0.4

0.6

0.8

1Anchor1

Rec

all


0

0.2

0.4

0.6

0.8

1Anchor2

Rec

all


0

0.2

0.4

0.6

0.8

1Coat of Arms

Rec

all

# retrieved images

500 1000 15000

0.2

0.4

0.6

0.8

1Circle

Rec

all


0

0.2

0.4

0.6

0.8

1Bell

Rec

all


0

0.2

0.4

0.6

0.8

1Heart

Rec

all


0

0.2

0.4

0.6

0.8

1Column

Rec

all

# retrieved images

500 1000 15000

0.2

0.4

0.6

0.8

1Hand

Rec

all


0

0.2

0.4

0.6

0.8

1Sun

Rec

all


0

0.2

0.4

0.6

0.8

1Bull Head

Rec

all


0

0.2

0.4

0.6

0.8

1Flower

Rec

all

# retrieved images

500 1000 15000

0.2

0.4

0.6

0.8

1Cup

Rec

all


0

0.2

0.4

0.6

0.8

1Other objects

Rec

all

# retrieved images

Figure 5.14: Class-wise recall vs. number of images retrieved graphs for the Watermarks.

94

Applic

atio

ns

Table 5.3: Class-wise performance measures for the Watermark database. A detailed description can be found in Section 5.2.2and in the text.

Measures\Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14Nr 322 115 139 71 91 44 197 126 99 33 14 31 17 416

Rank .1360 .3215 .1976 .1772 .4508 .2333 .4444 .3639 .1326 .2229 .1261 .0835 .092 .3621Rank1 1 1 1 1 1 1 1 1 1 1 1 1 1 1P (10) .7435 .5515 .5414 .2241 .3184 .4194 .4597 .5000 .6886 .1058 .0853 .9167 .4759 .6195P (30) .6593 .4298 .3565 .1532 .1643 .0968 .3165 .2256 .5744 .0342 .0194 .1391 .0517 .5042P (N/2) .4916 .2431 .2138 .1435 .1089 .2441 .1730 .0973 .4417 .0684 .1895 .8024 .5560 .2830P (N) .2228 .0744 .1187 .0643 .0540 .0292 .1157 .0755 .0822 .0319 .0194 .0241 .0517 .2427P (Best, 10) 1 1 1 .4762 1 1 1 1 1 .2778 .2857 1 .9091 1P (Best, 30) 1 1 .9091 .2632 .4839 .2206 .9091 .6977 1 .0433 .0253 .4545 .1269 1P (Best5, 10) 1 1 1 .4015 .9848 .9268 1 1 1 .2250 .1479 1 .8460 1P (Best5, 30) .9731 1 .8395 .2378 .4309 .1893 .8913 .6478 1 .0406 .0214 .3645 .0842 1R(N) .4783 .1217 .2590 .1690 .0659 .0909 .1320 .1587 .2828 .0606 .0714 .4839 .235 .3197R(N/2) .5280 .1391 .3022 .1972 .0879 .1818 .1523 .1905 .2626 .0606 .1429 .7097 .352 .3606R(Best, N) .6118 .5565 .4245 .3099 .3736 .5227 .4467 .3651 .6667 .2727 .4286 .7742 .647 .5312R(Best, N/2) .7764 .8596 .6377 .3714 .5778 .8182 .6531 .5556 .9388 .4375 .7143 1 1 .7019

5.4 Filigrees 95

Table 5.3 shows a variety of performance measures (see Section 5.2) for every class.

Note, that all measures are class-wise averaged. We can see that the first match is always

identical with the query, indicated by Rank1 equal to 1. In addition we list precision

values at various positions, i.e. the value of P (10) for class 1 means that on the average

almost 75% of the first 10 images retrieved belong to the first class. Moreover, we present

the precision after the 10 and 30 retrieved images for the best query P (Best, 10) and

P (Best, 30) for every class in order to see the differences between the class-wise average

and the single best query image. We see that for several classes P (Best, 30) is equal to

1, i.e. that the first 30 images belong to the query’s class and are ranked in decreasing

similarity. P (Best5, 10) and P (Best5, 30) present the average precision of the best five

queries after the first ten and 30 retrieved images, respectively. The numbers show that we

obtain the highest possible score for several classes. Additionally, we report the averaged

recall R(N) and precision values P (N) after N retrieved images, where N is the number of

class members. Finally, R(Best, N) and R(Best, N/2) depict the averaged recall numbers

after N and N/2 retrieved images. We can interpret the values of all measures given in

Table 5.3 as the dynamics of our results. Thus, we agree with the arguments listed in [95]

that only a broad set of performance measures can ensure a thorough evaluation of CBIR

results16. However, we do observe some classes of worse performance. That is to a large

extent due to the high intra-class variation of the database. Figure 5.3 shows the strong

intra-class variation of several classes. Since, CBIR performs a similarity ranking some

class members might be less similar to a certain query (from the same class) then images

from other classes as for example image d and e in Figure 5.17. Both images show eagles,

but do not belong to the same class. Similar observations hold for many other cases.

In Section 5.4.4 we will discuss the class discrimination ability of our structure-based

features for the same image database. The results will not show a similarity ranking based

on the features of one image, but rather the membership of images to a certain class

(based on SVM learning). Before proceeding with the classification, we want to discuss

the problem of partial matching for watermark images.

5.4.3 Partial Matching

Partial matching is the retrieval of image substructures from an image database. The

image retrieval results for the ancient watermark images have shown that various classes

contain similar filigree or sub-filigree, such as for example the Eagle and Coat of Arms

classes. For these classes it makes sense to investigate partial matching. In Section 4.3 we

have motivated the usage of the intersection measure for determining the image similarity.

In addition to its good performance it supports partial matching. Therefore, we cut out an

16Note, that the analysis of our results largely agrees with the propose evaluation strategy presented in[95].

96 Applications

arbitrary subregion of a watermark image and compute the same features as for the CBIR

application. Figure 5.15 shows the partial matching result obtained with the intersection

measure. The query image shows a part of an eagle that was cut out from an image of the

Coat of Arms class. The retrieved filigrees mainly contain eagles or parts of eagles that

are similar to the query image.

A second partial matching result is depicted in Figure 5.16, where the query is an eagle’s

head wearing a crown. Several of the retrieved images represent eagles, including heads

and some are wearing a crown. The fourth retrieved image only contains parts of an eagle,

a wing like structure. It is not obvious why this image is retrieved at the rank four. We

might argue that the bins of the wing match with the bins of the feather-like structure in

the query image.

The biggest advantage of the intersection based partial matching is its linear complex-

ity resulting in a fast matching time that allows online applications17. However, although

the histogram intersection measure produces good partial matching results for watermark

images, there might be cases of worst performance (e.g. strongly cluttered scenes of color

images). Such cases require other approaches as for example a sliding window based his-

togram matching, where the window is of the query image size. Then, is it possible to

extract the features from each window for every database image. The matching complex-

ity is at least O(M2) since for every window a matching steps has to be performed. In case

of overlapping windows the increase in complexity is even higher. Different implementa-

tions of window sliding algorithms can be found in [64] and [97].

5.4.4 Classification Results

Previously, we have retrieved similar watermark images. Now we want to learn the fea-

ture distribution of every class in the feature space. Therefore, the classification of the

watermark images is treated as a learning problem. The features of every class are learned

with a support vector machine (see Section 5.3). The classification results are obtained

with leave-one out tests and SVMs under the usage of different kernel. Specifically, we

have obtained the best results with the intersection kernel and a cost parameter C = 220.

We have used the same features as for the retrieval task. The feature vectors have been

normalized according to zero mean and unit variance (see Section 4.4). In the follow-

ing we present a thorough evaluation of the results with a confusion matrix and various

performance measures.

Table 5.4 shows the confusion matrix obtained from the leave-one-out classification of

the ancient Watermark database. A closer look at the confusion matrix identifies possible

inter-class misclassification (false positives), such as between classes one and four. Addi-

17Matching times vary between two and four seconds.

5.4 Filigrees 97

0.976747 0.976599 0.973386 0.972307

0.9720180.971106 0.970937 0.970842 0.969966

Query

Figure 5.15: Partial matching result obtained from the Watermark database with ourstructural features (see Section 4.2) and under the usage of the intersection similaritymeasure. The query image resembles a cut of a filigree from the class Coat of Arms.

0.973879 0.971251 0.970363 0.969645

0.969336 0.969039 0.968719 0.968309 0.968181

Query

Figure 5.16: Partial matching result obtained from the Watermark database with ourstructural features (see Section 4.2) and under the usage of the intersection similaritymeasure. Note, that the query resembles a head of an eagle that belongs to the class Eagle.

98 Applications

Table 5.4: Confusion matrix for the Watermark database.

Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Members1 296 0 5 16 1 1 2 0 0 0 0 0 0 1 3222 3 100 5 0 2 1 2 1 0 0 0 0 1 0 1153 5 3 121 3 0 1 3 0 0 2 1 0 0 0 1394 31 0 2 33 0 0 1 1 0 3 0 0 0 0 715 0 4 3 1 69 0 6 3 1 0 0 0 0 4 916 0 1 0 0 2 34 5 1 1 0 0 0 0 0 447 3 2 6 1 9 1 161 4 1 2 1 0 0 6 1978 2 1 0 0 3 1 10 109 0 0 0 0 0 0 1269 1 1 1 0 1 1 2 1 91 0 0 0 0 0 9910 6 0 5 0 0 0 3 1 0 18 0 0 0 0 3311 1 1 3 0 0 0 1 0 0 0 8 0 0 0 1412 0 0 0 0 0 0 0 0 0 0 0 31 0 0 3113 0 1 0 0 0 0 2 0 0 0 0 0 14 0 1714 0 0 0 0 1 0 1 0 0 0 0 0 0 414 416

tionally, classes seven (Heart) and five (Circle) show a clear tendency of being confused

with each other. We have manually identified the ill-classified watermarks and could reveal

that some of the circle filigrees do contain largely identical substructures.

Table 5.15 shows the class-wise true and false positive cases obtained with a leave-

one-out test. We can see that for most of the classes a high recognition rate is achieved.

In total, a 87.41% true positive rate could be achieved. However, classes four, ten and

11 show lower recognition rates. A possible explanation is discussed earlier in Subsection

5.4.2. A detailed visual inspection shows that many members of various classes possess

a high inter-class similarity in terms of their real content18. All classes exhibit a large

intra-class variability. If we reassign all eagles from the Coat of Arms class to the Eagle

class19, the true positive rate would exceed 90%.

A single measure is not sufficient to validate the content of a confusion matrix [77].

Influences of an imbalanced class distribution may be reflected by just a single quantity,

whereas multiple measures tend to be more robust and significant in order to guarantee

a concise evaluation of the results [71]. Therefore, in order to assess the classification

results in detail and un-biased, Table 5.8 displays various performance measures. It is easy

18In Figure 5.17 we show a few representative examples of watermarks with an ambiguous class mem-bership.

19A visual check of all 31 mis-classified Coat of Arms revealed that all of them are featured with a kindof eagle.

5.4 Filigrees 99

Table 5.5: Class-wise true positive (TP) and false positive (FP) rates for the Watermarkdatabase, where the first column indicates the correctly classified images and the totalnumber of class members. The second column shows the TP rate in [%]. Column threerepresents all FP obtained and column four gives the FP rate in [%].

Class True Positive Fales Positive1 296/322 91.93 52/1393 3.7332 100/115 86.96 14/1600 0.8753 121/139 87.05 30/1576 1.9044 33/71 46.48 21/1644 1.2775 69/91 75.82 19/1624 1.176 34/44 77.27 6/1671 0.35917 161/197 81.73 38/1518 2.5038 109/126 86.51 12/1589 0.75529 91/99 91.92 3/1616 0.185610 18/33 54.55 7/1682 0.416211 8/14 57.14 2/1701 0.117612 31/31 100 0/1684 013 14/17 82.35 1/1698 0.058914 414/416 99.52 11/1299 0.8468

Total: 1499/1715 87.41 216/1715 12.59

100 Applications

(a) Class Heart (b) Class Heart (c) Class Coat of Arms

(d) Class Coat of Arms (e) Class Coat of Arms (f) Class Circle

(g) Class Heart

Figure 5.17: The seven images show examples of watermarks with ambiguities in respect to theirground truth class membership and their real content. The labels below each image show thewatermark class. Image 5.17(a) belongs to the class Heart, although there is just a tiny heat inthe center of the watermark. In fact, it looks more like a Coat of arms. A similar argumentationholds for image 5.17(c). Note, the embedded eagle. Specifically, 5.17(c) and 5.17(d) were classifiedas Eagle.

5.4 Filigrees 101

Table 5.6: Detailed performance measures for the Watermark database. The measures areexplained in 5.1.

Measures [%]Class

Accuracy TN FN P G − Mean1 G − Mean2 F1 95.45 96.27 8.07 85.06 88.42 94.07 88.362 98.31 99.12 13.04 87.72 87.34 92.84 87.343 97.20 98.10 12.95 80.13 83.52 92.41 83.454 96.56 98.72 53.52 61.11 53.30 67.74 52.805 97.61 98.83 24.18 78.41 77.11 86.57 77.096 99.07 99.64 22.73 85.00 81.04 87.75 80.957 95.69 97.50 18.27 80.90 81.31 89.26 81.318 98.31 99.24 13.49 90.08 88.28 92.66 88.269 99.36 99.81 8.08 96.81 94.33 95.79 94.3010 98.72 99.58 45.45 72.00 62.67 73.70 62.0711 99.53 99.88 42.86 80.00 67.61 75.55 66.6712 100.00 100.00 0 100.00 100.00 100.00 100.0013 99.77 99.94 17.65 93.33 87.67 90.72 87.5014 99.24 99.15 0.48 97.41 98.46 99.34 98.45

W. Average 97.64 98.39 12.59 87.12 87.19 92.45 87.13

to recognize that the weighted averaged accuracy20 for the presented 14-class problem is

higher than 97% and that the combined measures G − Mean1 and G − Mean2 score in

average at 87.19% and 92,45%, respectively. It is worth to notice that the F-measure is as

high as 87.13%, with β = 1.

5.4.5 Conclusion

The retrieval and classification of watermark images is of great importance for paper his-

torians. The results show that structure is a powerful feature for this tasks. The retrieval

results have shown that the presented features work quite well. However, for some classes

the recall versus number of images plots did perform worse. That is due to two factors, as

we have shown. Firstly, the large visual intra-class variability. Filigrees of the same class

often look very different, which causes a low precision, in the evaluation. Secondly, some

classes exhibit prominent substructures of others such as for example the classes Eagles

and Coat of Arms. Next, we have performed a classification of the watermark images. A

support vector machine with intersection kernel was able to successfully learn the char-

acteristics of every class. A classification rate (true positive rate) of more than 87% is

an indicator of a good performance. The presentation of several performance measures,

20Defined in Equation 5.10.

102 Applications

derived from the confusion matrix gives a complete evaluation of the results. In future

work, we would like to apply the structural features to a larger database of watermarks.

In addition, we would like to reduce the confusion between the classes Eagles and Coat of

Arms.

5.5 Color Image Retrieval

For the current experiments, we use two different real world color image databases. The

task is to find similar images within the two datasets. The evaluation of the results ob-

tained from the first dataset consists of a detailed class-wise analysis. Due to computational

resource- and time limitations we provide a less comprehensive representation of the ex-

periments performed with the second dataset containing 10.000 images.

The results are evaluated with precision recall graphs and compared with two well

known methods, namely global color histograms and the local invariant feature histogram

algorithm [129]. The color histogram is widely known since the work of [133]. More recent

approaches towards color histograms can be found in [38], [138] and in the survey of [142].

Since color histograms are commonly used in CBIR, they serve well as a comparative

feature. However, in our experiments we have obtained a 32-bin color-histogram from

the YCbCr chrominance channel of each image. The invariant feature histogram method

is adopted due to its good performance as a local texture-based feature. In [25], the

method was compared along with nine other algorithms and performed best for one of

two image databases used in their experiments. The invariant feature histogram method

based on Haar integrals introduced by [124], has shown good image retrieval results [129].

The method constructs invariant features by applying nonlinear kernel functions to color

images and integration over all possible rotations and translations.


For the retrieval task we use the global structure-based features Hglobal as described in

Chapter 4. For the current experiment we are interested in the performance of global

structure features and if they are suitable to be combined with local descriptors. Therefore,

we combine the global structure features with block-based features (BBF21, [15], [86] and

[87]). The local method (BBF) computes pixel value distributions of equally sized blocks.

Specifically, the method computes 3 block-based features, where for the first one the higher

and lower mean histograms of the block-based image representation, containing the pixel

value distributions above and below the mean value for the gray-level channel in the YCbCr

21The block-based features have resulted from a joint work with Prof. Z.M. Lu, Harbin Institute ofTechnology, China.

5.5 Color Image Retrieval 103

1 0.787933

0.775795 0.754991 0.750781

0.745174 0.739979 0.727213

query

(a) Sample image retrieval result for an image out of the class: Mountain.

1 0.682571

0.672512 0.650146 0.639879

0.625853 0.623612 0.621926

query

(b) Sample image retrieval result for an image out of the class: Beach.

Figure 5.18: Sample image retrieval results obtained from the 1.000 image collection. Asfeatures we have used block-based features (see Section 5.5.1), and the global structurebased features (see Section 4.2.1). The first image represents the query and the images arearranged in decreasing similarities from left to right indicated by the numbers above theimages. Note, that 1 denotes an identical match with the query image.

104 Applications

1 0.997511

0.997492 0.997377 0.997355

0.997346 0.997343 0.997333

query

(a) Sample retrieval result for an image out of the class: Buses.

1 0.849547

0.831033 0.823511 0.818884

0.811361 0.807883 0.802103

query

(b) Sample retrieval result for an image out of the class: Elephants.

Figure 5.19: Sample retrieval results obtained from the 10.000 image collection.As featureswe have used block-based features (see Section 5.5.1), and the global structure based fea-tures (see Section 4.2.1). The first image represents the query and the images are arrangedin decreasing similarities from left to right indicated by the numbers above the images,where 1 denotes an identical match with the query image.


color space. The second block-based feature is the binary pattern histogram where binary

blocks of the same size are matched with a pre-generated codebook of binary patterns. The

final block-based feature is a histogram based on the 2 chrominance channels Cb and Cr.

Note, that we only use the global structure-based features. In contrary to the watermark

application, we exchange the local structure features H local with the block-based features.

The synthesis of the two features results in a robust semantic image descriptor.

After extracting and normalizing the image feature vectors (see Equation 4.54), we

need to determine which images are the most similar ones to any given query image.

Image matching is performed by comparing feature histograms out of a feature database,

where the feature histogram of each image is computed offline. We only need to compute

the query image’s feature histogram and perform a histogram comparison over the whole

database. For the current application we used the intersection similarity measure (see

Section 4.3).

In our evaluation strategy, we use each image in both databases as the query image

to compute the precision and recall. By averaging the results for each class we obtain a

qualitative performance analysis for all image classes. Before we present a detailed class-

wise analysis of the experiments we show sample results.

We selected some sample queries from both datasets and display their results in in

Figure 5.18 and Figure 5.19, respectively. The images in Figure 5.18 were obtained from

the 1.000 image data-set and the results of Figure 5.19 were retrieved from the 10.000

image collection. Each panel displays sample results, where the top-left image is the query

image. All other images are ranked in decreasing similarities, where the number above

each image points out the similarity, with values in the range of [0, 1], with one denoting

an identical match with the query image. The result in Figure 5.18 and 5.19 reveals that

the first eight retrieved images belong to the query image class.

Figure 5.20 shows the average recall versus the number of retrieved images graph for

all ten classes of the 1.000 image database. Each graph in the figure shows five different

curves representing the performance of the global color-histogram, local invariant feature

histogram, global line segment features, the set of block-based features and the combination

of the latter two. It can be seen that the combination the structure and block-based features

performs best for most of the classes, for the first 100 retrieved images. For class Buses the

global line structure feature performs better than the combination with the block-based

feature (BBF). This can be explained with a very prominent geometric structure of the

buses, which is here best encoded by the global line structure features. Thus, the texture

information is of less importance than the geometric arrangements.

We believe that a sophisticated feature selection method might improve the results of

the combined features by applying larger weights to the line segment features. However,

for the current experiments we did not implement feature selecting, feature weighting or

106 Applications

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all

Number of Images retrieved

Africa

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Beach

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Buildings

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Buses

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Dinosaurs

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Flowers

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Elephants

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Horses

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Food

100 200 300 4000

0.2

0.4

0.6

0.8

1

Rec

all


Mountains

Line Segment Feat.Block−Based Feat.Line Seg. + Block Feat.Invariant Feat. Hist.Color Histogram

Figure 5.20: Average class-wise recall versus the number of retrieved images graph for the1.000 image database. All features are plotted for class-wise comparisons. Note that thecurves represent averaged quantities, i.e. each class member was taken as a query imageand the resulting graphs were averaged.


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall

Line Segment Feat. Block−based Feat.Line + Block Feat.Invariant Feature Hist. Color Hist.

Figure 5.21: Precision-recall graph for the 1.000 image data-set, where the graph is averagedover all images and classes representing an overall performance measure.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall

Line Segment Feat.Block−based Feat.Line + Block Feat.Color Hist.Invariant Feat. Hist.

Figure 5.22: Precision-recall graph for the 10.000 image database, where the graph isaveraged over all images and classes representing an overall performance measure.

108 Applications

any further relevance feedback methods to improve the results. We rather intended to give

a clear presentation of our features and postpone the merits of improvement to a later

time.

In the class Horses our features perform worse than the global color-histogram. A pos-

sible reason for that might be found in the color distribution of the image class, i.e. about

80% of horses in that class are standing or running on green meadows which introduces a

significant background. In terms of pixels it clearly shows that color dominates the infor-

mation. The fraction of pixels belonging to horses is much smaller than the background

area and thus, color is a stronger feature than structure.

Though the curves of some classes look quite similar to each other, the overall precision-

recall graph in Figure 5.21 proves the superior performance of our features. The graph

represents the averaged precision-recall over all ten classes. The figure shows that the

combination of the local and global features reaches a precision of 100% until a recall

of about 20 % is reached. An interesting observation can be found in the part between a

recall of about 0.25 and 0.35, where the color-histogram performs better than the combined

features. Our investigations revealed that the higher performance of the color-histogram

for the class Horses resulted in a superior recall for this part of the curve.

Figure 5.22 depicts the precision-recall graph averaged over all 25 image categories

of the larger dataset (10.000 images). As well as for the first data-set, we compute the

precision recall curve for all other features. The combination of the global structure and

block-based features performs better than the others methods.

5.5.2 Invariance Analysis

Though the results shown so far give a good idea of the proposed feature’s performance,

we did not give a measure of the robustness or invariance for all methods. It is of great

importance for CBIR applications to be invariant or robust under common image trans-

formations. Very common ones are changes of the illumination, rotation or both simulta-

neously. Hence we conduct an invariance analysis, where we compute all features for each

image for two newly created data-sets. To cope with these kinds of transformations, we

have derived two data-sets from our 1.000 image database. In the first data-set, we increase

for all images the brightness by 30% and decreased the saturation by 10%. The second

data-set undergoes the same transformation as the first, but with an additional rotation

of all images by 90 degrees counter-clock wise.

For the actual invariance test we determine the similarity between each feature of the

database. We take the histogram of each image from the original database and compute the

similarity based on histograms obtained from the transformed image data-set, where the

brightness and saturation of every image was changed. In the second step, we computed the

similarity of each image feature with respect to histograms obtained from the data-set with


100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sim

ilarit

y

Images

Line Segment Feat.Block−based Feat.Line + Block Feat.Invariant Feature Hist.Color Hist.

Figure 5.23: Robustness analysis of the 1.000 image database for different brightness androtation conditions (brightness 30 % increased, saturation 10 % decreased). The ordinateshows the degree of averaged invariance for each feature set, with 1 being 100 % invariant.

different brightness, saturation and rotation. As measure to compare two histograms of the

original and the transformed image we use the L1-Norm, resulting in a value between [0, 1]

with 1 indicating 100% invariance. The results are displayed in Figure 5.23, where we can

see five similarity versus image number plots. The curves show the variation of the features

taken from the original image data and the data-set with a different brightness, saturation

and rotation. A constant line at similarity equal to 1 indicates 100% invariance. From the

figure we can conclude that the line segment feature performs best and the combination

of block-based and line segment features second best and the color histogram the worst.

For a qualitative representation of the feature invariance have a look at Table 5.7, where

we present the feature similarity values averaged over all images; for each feature and all

databases used; i.e. we provide one number as invariance descriptor. One can see, that the

global line structure feature performs best in average. The combination of our block-based

and global line structure features is slightly worse than the invariant feature histograms.

The comparison of the original image features with the features of just rotated images is

omitted, since this information can be extracted from Table 5.7. In fact, all features are

equally robust to rotation.

110 Applications

Table 5.7: Averaged feature invariance representation. First column: Invariance underdifferent brightness conditions (brightness 30 % increased, saturation 10 % decreased).Second column: Invariance under different brightness and rotation. Third column: Invari-ance between: different brightness conditions and different brightness plus rotation. LS:Line segment feat., BBF: Block-based feat., IFH: invariant feature histogram.

Invariance of Features under Image Transformations [%].Features

Bright. Rot.-Bright. Bright. vs. Rot.-Bright. AverageLS 97.14 96.12 96.48 96.58

LS+BBF 90.99 89.67 93.92 91.53BBF 86.25 86.25 86.25 86.25IFH 91.05 90.83 99.49 93.79Color 65.65 65.62 99.62 76.96

5.5.3 Conclusion

We have presented retrieval results for two color image databases. The global structure-

based features have clear invariance properties that constitute their usefulness to CBIR

applications. Moreover, we have combined the global structure-based features with block-

based features. The block-based features are derived from equally sized non-overlapping

4 × 4 image blocks. The global structure-based features contain information of relative

spatial arrangements of straight line segments.

Although the global structure features produce good results, the combination of the two

methods improves the performance. The precision recall graphs, averaged over the whole

image database, reveal for both image data-sets that combined features perform better

than the invariant feature histogram method and also outperforms the color histogram. It

has to be said, that for some classes the features perform slightly worse, but averaged over

all classes the precision recall graphs document a better performance.

The feature robustness analysis of the experiments revealed, that the global line struc-

ture features perform best in average, under changes of the illumination and under changes

of the illumination plus rotation with a mean invariance of the feature histograms of 96.58%.

The invariance of the combination of the global structure features and the local block-based

method is with 91.53% slightly worse than the very robust invariant feature histogram

method with 93.79% invariance.

Although the block-based features perform, according to the precision recall graphs

more often better than the structure-based features, their combination gains in performance

and robustness towards image transformations.

The obtained results encourage for further enhancement, i.e. feature selection and/or

sophisticated weighting schemes that might improve the feature combination process, lead-

5.6 Object Class Recognition and Retrieval 111

ing to better results.

5.6 Object Class Recognition and Retrieval

In this section we present the results obtained for the Caltech database (see Section 5.1.3).

Specifically, we performed two experiments: the classification of the five image classes

under the usage of the original class labels and content-based image retrieval for all images

of all classes. For the first experiment we report the category confusion matrix and several

performance measures, whereas for the second we present averaged precision recall graphs

and the same measures as in Section 5.4. A critical review of the literature reveals that

occasionally results from subsets are reported. It turns out that a common practice is the

usage of three (sometimes four) classes airplane, faces, (cars,) motorbikes [101]. However,

for our evaluation we use the complete dataset. In order to compare our results with others,

we have additionally performed experiments with just the three classes.

Features

For the subsequent results we have used global and local features as described in Chapter 4.

The feature set Hglobal, H local is normalized (see Equation 4.54) and computed offline for

the complete database. Note, that we used the same features for the classification and the

image retrieval task.

5.6.1 Classification Results

In this subsection we report the classification results for the Caltech database obtained

with a multi-class support vector machine and a leave-one-out-test22. The leave-one-out

test was used in order to overcome selection effects of splitting the dataset into training

and test sets. For the classification we use a support vector machine with a histogram

intersection kernel as defined in Equation 5.24. The best result was obtained for a penalty

cost parameter C=221. Note, that we have also used other kernels such as RBF (with

different γ, polynomial (of various degree), sigmoid and linear. Since their classification

results are some percents lower then with the histogram intersection kernel, we decided

not to list their outputs. The features have normalized to zero mean and unit variance

as described in Section 4.4. Table 5.8 contains the class-wise confusion matrix. The last

column shows the number of members for each class. The integers in the table represent

absolute numbers of classified instances. The advantage of a confusion matrix is the easy

identification of true and false positives. Note, the exact true and false positive rates are

listed in Table 5.9. One can observe that the class airplanes depicts the best classification

22N-fold cross validation, with n equal to the number of images in the database.

112 Applications

Table 5.8: Confusion matrix for the Caltech database.

Class 1 2 3 4 5 # Members1 1036 0 4 3 31 10742 2 115 4 4 1 1263 6 5 411 19 9 4504 10 3 36 127 10 1865 30 1 16 7 772 826

Table 5.9: Class-wise true and false positive rates for the Caltech database, where the firstcolumn indicates the correctly classified images and the total number of class members.The second column shows the TP rate in [%]. Column three represents all FP obtainedand column four gives the FP rate in [%].

Class True Positive False Positive1 1036/1074 96.46 48/1588 3.022 115/126 91.27 9/2536 0.3553 411/450 91.33 60/2212 2.714 127/186 68.28 33/2476 1.335 772/826 93.46 51/1836 2.76

Total: 2461/2662 92.45 201/2662 7.55

rate of 96.46%. For the class cars we could only identify 9 false positives resulting in a

classification rate of more than 91%. The highest number of false positives, 33 in number,

is observed for the class leaves. A detailed look at the confusion matrix in Table 5.8 reveals

that leaves were mainly confused with faces. In fact, almost 60% of all false positives

are classifies as faces and similarly for the reversed case. Apparently, the features do not

discriminate these two classes very well. A possible explanation can be given by a detailed

visual inspection of both image classes. It can be observed that leaves exhibit only a few

structures and that the used features are not powerful enough the perform as good as for

the other four classes. The classification rate for motorbikes is as high as 93.46%. The

classification rates for faces and cars are comparable.

The class-wise averaged classification rate is 92.45% with a false positive rate of 7.55%.

Occasionally, the literature reports other performance measures such as for example preci-

sion, F-measure or accuracy. Table 5.10 lists the various measures for all five classes. The

classifier achieved an accuracy of 96.57%, a precision of more then 92% and a F-measure

of 92.35% that is comparable with the true positive rate. Note, that the averaged values

are weighted according to the class sizes. For the F-measure we used β = 1, i.e. recall and


Table 5.10: Detailed performance measures for the Caltech database (see Section 5.2).

Measures [%]Class

Accuracy TN FN P G − Mean1 G − Mean2 F1 96.77 96.98 3.54 95.57 96.02 96.72 96.012 99.25 99.65 8.73 92.74 92.00 95.37 92.003 96.28 97.29 8.67 87.26 89.27 94.26 89.254 96.54 98.67 31.72 79.38 73.62 82.08 73.415 96.06 97.22 6.54 93.80 93.63 95.32 93.63

W. Average 96.57 97.35 7.55 92.35 92.3 94.78 92.35

precision are equally weighted.23

In order to give a comparison of our results with state of the art algorithms we list the

Table 5.11 lists classification rates of several authors in Table 5.11. For a fair comparison we

only list results obtained from the full set of images. Our classification rates are given for

multi-class experiments obtained from complete leave-one-out tests. Moreover, most results

have been only reported on class-wise classification methodologies, i.e. a classifier was

trained in order to discriminate a single class among the four listed ones in Table 5.11 from

a background class24. Only in [80] the authors additionally reported the class separation

performance for all four class in form of a confusion table. It is obvious that the latter

approach is more challenging then a pure one-class problem. The results of [80] confirm

that - for the class motorbikes - the classification rate dropped for about 3% between the

one-class and multi-class problem. This result suggests that the inter-category separation

is of a higher difficulty, but also gives a further insight into the discrimination ability of

a feature. However, Table 5.11 shows that our approach outperforms most of the other

Table 5.11: Comparison of our structure-based method with others from the literature.Classes Our [32] [80] One − vs − rest [80] [101]

Airplanes 96.55 90.2 93.7 95.4 n.ra

Faces 96.22 96.4 94.4 93.4 n.rMotorbikes 93.58 92.5 96.1 93.1 n.r

Aver. Class.b 95.45 93.0 94.68 94.21c 94.4

aSingle class results were not reported.bThe average weighted classification performance is computed with respect to the same classes in order

to guarantee a fair comparison.cThe authors report an overall classification rate of 93.46% for four classes. For a fair comparison we

only consider the three classes listed in the current table.

23The F-measure can be interpreted as a harmonic mean.24The background class consists of arbitrary images.

114 Applications

methods. Notice, that we repeated the classification for the very same image classes in order

to guarantee a fair comparison. For the class faces our approach is slightly less performant

then the one in [32]. For the class motorbikes [80] could report a higher classification rate

for the one-class approach. However, for the class separation task the performance drops

below our one. The overall classification rate of our method is the highest with more than

95%. Note, that the classification rate of our features increased for 3% in comparison to

the five class problem (see Table 5.9).


In this subsection we present the CBIR results for the Caltech database. Now we will find

a ranking of similar images without the usage of a training set25. In fact, the similarity

measure has to determine the degree of resemblance for each (feature) vector with the query

vector in the high dimensional feature space Rn. However, for the current experiment we

used the histogram intersection measure as we have motivated in Section 4.3.

Figure 5.24 shows a sample retrieval result for the class airplanes. The first image

retrieved is the query image and the other images displayed belong to the same class.

The numbers above each image denote the degree of similarity with the query, where 1

indicates an identical match. A more detailed look reveals that not only images from the

same class are retrieved but all airplanes appear in a quite similar environment, which is

due to the structure-based features encoding the whole scene (e.g. there are no airplanes

in the sky among the first retrieved samples). In Figure 5.25 we can see a sample retrieval

for the class of motorbikes. The query image is found as the first match. The images are

sorted in decreasing similarity with respect to the query denoted by the numbers above

each image. The structure-based features capture the geometric appearance such that the

retrieved motorbikes exhibit a similar placement and a fairly similar type.

Sample retrievals are a good means for a visual inspection but lack an overall objectivity.

Therefore, it is absolutely necessary to report performance measures such as precision-recall

graphs and other measures for an objective evaluation. Figure 5.26 displays five class-wise

averaged precision-recall graphs. The class airplanes performs very well, in average slightly

less then 40% of all relevant images are retrieved at precision 1. That means that in average

for an airplane query each of the first 380 retrieved images belong to the same class. The

graph for the class cars looks a bit different. In fact, the precision drops faster from the

value one but decreases smoother with increasing recall then for airplanes. The worst

performance is achieved for the class leaves26. The class leaves exhibits high difficulties.

The leaves fill a rather small area of each image, so background information is strengthened.

In addition, leaves do not contain a lot of structural information and our features are not

25Note, that we take the same features as for the classification task.26Note, that the class leaves is almost never used in the literature.


1 0.986649 0.984584

0.979556 0.974421 0.971927 0.97137

0.971323 0.971165 0.969762 0.969198

0.966765 0.966592 0.964933 0.964622

Query

Figure 5.24: Result obtained with structure features (see Section 4.2), from the Caltech database.

116 Applications

1 0.797734 0.790854

0.786349 0.785345 0.7840540.782529

0.780913 0.779573 0.7757250.775349

0.775287 0.774637 0.774448 0.772638

Query

Figure 5.25: Result obtained with structure features (see Section 4.2), from the Caltech database.


0 0.5 10

0.2

0.4

0.6

0.8

1Airplane

Pre

cisi

on

Recall0 0.5 1

0

0.2

0.4

0.6

0.8

1Cars (rear)

Pre

cisi

on

Recall0 0.5 1

0

0.2

0.4

0.6

0.8

1Faces

Pre

cisi

on

Recall

0 0.5 10

0.2

0.4

0.6

0.8

1Leaves

Pre

cisi

on

Recall0 0.5 1

0

0.2

0.4

0.6

0.8

1Motorbikes

Pre

cisi

on

Recall

Figure 5.26: Class-wise averaged precision-recall graphs for the Caltech image database.

as discriminative as for the other four classes. The combination with a color histogram

might improve the performance for the class leaves. The precision-recall graphs for faces

and motorbikes evidence a good retrieval performance.

Invariance

In the previous sections we have shown the appropriateness of our structural features for

object class retrieval. The precision recall graphs have shown a good performance for the

five classes. Now, we want investigate in the invariance properties of the features. In Section

5.5.2 we have already shown the invariance properties of the our features. The comparison

with other methods has proven a high degree of invariance for the structure-based features.

Now, we want to investigate the invariance or robustness against seven non-linear image

118 Applications

transformations. Therefore, we take an image out of the class Motorbikes and apply seven

transformations to it: In detail, we perform a Gaussian blurring, add noise, add sparkle light

effects, flip the image, perform one affine and two projective transformations. The newly

created images serve as queries for the complete Caltech database. We are interested in the

retrieval performance and in the retrieval rank of the original unchanged image. Figures

5.27 and 5.28 show the eight retrieval results. The figures show that the transformed query

images have strongly changed their visual appearance. The result in Figure 5.27(a) is

obtained from the original image. The first retrieved image is identical with the query, as

indicated by the intersection similarity measure of 1. The unaltered motorbike is retrieved

as first for the case of Gaussian blurring (see Figure 5.27(b)). Although the similarity

measure as decreased from 1 to 0.823. Similarly, the original image exhibits rank 1 for

the sparkle light effect (see Figure 5.27(d), the flipped image (see Figure 5.28(a), the

affine transformation (see Figure 5.28(b)) and for the second projective transformation

(see Figure 5.28(d)). The motorbike image with added noise retrieved the original image

at position two. For the first projective transformation the original motorbike is retrieved

at rank three. Note, that the second projective transformation changes the original image

extremely.

The results show that the structure-based features are mostly robust against the seven

described transformations. Figure 5.29 shows seven precision-recall graphs for the each of

the transformed query image. The seven graphs are quite similar, except the graph for

the first projective transformation which is significantly worse then the others. The results

show that the similarity score decreases for all transformations. Nonetheless, the retrieval

rank of the original image remains 1 for most of the transformed image queries. Although

the results are very promising, further investigations are necessary in order to make a

general statement of the feature’s robustness under the seven discusses transformations.

5.6.3 Conclusion

In this section we have presented results obtained from classification and retrieval exper-

iments performed with the Caltech image collection. For the classification we have used

a multi-class SVM with a histogram intersection kernel. The results are very competitive

as shown by a comparison with state of the art algorithms. In fact, we have obtained

a classification rate of 92.45% for the five class problem and a 95.45% rate for the three

class problem. For the second experiment we have demonstrated averaged precision recall

graphs and various performance measures. The graphs show that our structure-based fea-

tures perform well for object categorization and retrieval tasks. We have completed the

experiment with an investigation in the robustness/invariance under seven non-linear im-

age transformations. The results show, that for most transformations the retrieval rank of

the unaltered image remains 1. In future work we are interested in performing a thorough


1

0.8846980.878476

0.874475Query

(a) Original

0.8231290.81152 0.806432 0.804441

Query

(b) Gaussian blurr

0.891755

0.883486

0.8705310.865858

Query

(c) Random noise

0.883973

0.87090.864979

0.861888Query

(d) Sparkle light effects

Figure 5.27: Result obtained with structural features (see Section 4.2) for some transformations.

120 Applications

0.8721210.869341

0.8686240.8613

Query

(a) Image flip

0.88593

0.868887 0.8662950.864814

Query

(b) Affine

0.857005 0.854721

0.8474760.846256

Query

(c) Projective1

0.877945

0.8758260.874458

0.873334

Query

(d) Projective2

Figure 5.28: Result obtained with structural features (see Section 4.2) for some transformations.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cisi

on

Recall

OriginalBlurNoiseSparkleFlipAffineProjective1Projective2

Figure 5.29: Precision-recall graph for several image transformations for the query imageof Figures 5.27 and 5.28 that belong to the class Motorbikes.

122 Applications

Table 5.12: Confusion matrix for the Brodatz database.

Class 1 2 3 4 0 5 6 7 8 9 10 11 12 13 # Members1 94 0 1 1 0 0 0 0 0 0 0 0 0 962 0 96 0 0 0 0 0 0 0 0 0 0 0 963 1 0 95 0 0 0 0 0 0 0 0 0 0 964 0 0 0 95 1 0 0 0 0 0 0 0 0 965 0 0 0 2 94 0 0 0 0 0 0 0 0 966 0 0 0 0 0 85 0 1 0 0 0 0 10 967 0 0 1 0 0 0 95 0 0 0 0 0 0 968 0 0 0 0 0 0 0 95 0 0 0 0 1 969 0 0 0 0 1 0 0 0 95 0 0 0 0 9610 0 0 0 0 0 0 0 0 0 96 0 0 0 9611 0 0 0 0 0 0 0 0 0 0 96 0 0 9612 0 0 0 0 0 0 0 0 0 1 0 95 0 9613 0 0 0 0 0 4 0 0 0 0 0 0 92 96

invariance analysis for the seven transformations, i.e. compute averaged precision-recall

graphs for every image in the Caltech database under the seven image transformations.

5.7 Texture Class Recognition

In this section we demonstrate the performance of our structure-based features for a tex-

ture classification problem. Texture analysis and classification is a widely studied field

of research and a tremendous number of different approaches have been developed so far.

However, in the following we restrict the literature review to edge-based methods. Though

edge-based approach have not been as widely used as statistical, Gabor filters or wavelet

based methods, several studies have proven [103], [70], [134] and [81] its good classification

performance.

We use the global and local features Hglobal, H local, described in Chapter 4. The

feature vectors are normalized to zero mean and unit variance. In the current experiment

we apply a support vector machine with intersection kernel with a cost parameter of C = 220

and a leave-one-out cross validation. So, we partitioned the original set into n− 1 subsets,

with n being the number of patterns, in order to compute the average classification score.

Table 5.12 shows the confusion table for the 13 classes of the Brodatz database and in

Table 5.13 the corresponding true positive and false positive rates are printed. Note, that

each class consists of 96 members. For classes 2, 10 and 11 (Brick, Wate and Wave) a

100% classification rate is obtained. For classes 3, 4, 7, 8, 9 and 12 (Bubbles, Grass, Raffia,

Sand, Straw and Wood) only one image was falsely classified that is equal to a true positive

5.7 Texture Class Recognition 123

Table 5.13: True positive and false positive rates for the Brodatz database, where the firstcolumn indicates the correctly classified images and the total number of class members.The second column shows the TP rate in [%]. Column three represents all FP obtainedand column four gives the FP rate in [%].

Classes True Positive [%] False Positive [%]1 94/96 97.92 1/1152 0.086812 96/96 100 0/1152 03 95/96 98.96 2/1152 0.17364 95/96 98.96 3/1152 0.26045 94/96 97.92 2/1152 0.17366 85/96 88.54 4/1152 0.34727 95/96 98.96 0/1152 08 95/96 98.96 1/1152 0.086819 95/96 98.96 0/1152 010 96/96 100 1/1152 0.0868111 96/96 100 0/1152 012 95/96 98.96 0/1152 013 92/96 95.83 11/1152 0.9549

Total: 1223/124 98.00 25/1248 2.003

rate of 98.96%. Class 6 and 13 (Pigskin and Wool) performed the worst. A closer look at

the confusion matrix reveals that both classes are confused with each other. The texture

appears to be quite similar in terms of their edge and line-segment representation. These

two classes account for almost 60% of all false positives of the data set. Nonetheless, the

overall averaged classification rate for our method is 98.0%, which can be regarded very

competitive. In Table 5.14 we list additional performance measures such as for example

accuracy, precision and F-measure.

In order to provide a qualitative comparison with other approaches we list in Table 5.15

the class-wise classification rates, if available, for the work of [44] and [100]. A discussion

of the table reveals that although the method of [44] scores a 100% for five classes, their

overall performance is lower then our one. The authors in [100] have reported the overall

performance for two sets of features. However, the first number of 97.52% reports the

best result for their rotation invariant variance (VAR) operator (128 bins) that describes

the contrast of local image texture. A second feature, called the local binary pattern

(LBP) operator was introduced in order to extract a rotation and gray scale invariant

feature. In detail, the local neighborhood is binary thresholded at the gray scale level of

the center pixel. The best LBP run resulted in a correct classification of 96.88%. Finally,

the combination of both features improved their classification rate to 99.52%.

124 Applications

Table 5.14: Detailed performance measures (see Section 5.2.1), for the Brodatz database.

Measures [%]Classes

Accuracy TN FN P G − Mean1 G − Mean2 F1 99.76 99.91 2.08 98.95 98.43 98.91 98.432 100.00 100.00 0 100.00 100.00 100.00 100.003 99.76 99.83 1.04 97.94 98.45 99.39 98.454 99.68 99.74 1.04 96.94 97.94 99.35 97.945 99.68 99.83 2.08 97.92 97.92 98.87 97.926 98.80 99.65 11.46 95.51 91.96 93.93 91.897 99.92 100.00 1.04 100.00 99.48 99.48 99.488 99.84 99.91 1.04 98.96 98.96 99.43 98.969 99.92 100.00 1.04 100.00 99.48 99.48 99.4810 99.92 99.91 0 98.97 99.48 99.96 99.4811 100.00 100.00 0 100.00 100.00 100.00 100.0012 99.92 100.00 1.04 100.00 99.48 99.48 99.4813 98.80 99.05 4.17 89.32 92.52 97.43 92.46

W. Average 99.69 99.83 2.00 98.04 98.01 98.90 98.00

5.7.1 Conclusion

In the previous section we have presented results for the classification of Brodatz textures.

The structure-based method has obtained very good results that are similar to state of

the art methods. An average classification rate of 98% is reached. The result indicates

that straight line segments and their connectivity are useful for texture classification. We

believe that our method is a good alternative to Gabor and wavelet based methods for

texture recognition tasks. In future, we are interested in combining Gabor features with

our structural features for similar applications.

5.7 Texture Class Recognition 125

Table 5.15: Comparison of the structure-based method with others from the literature forthe Brodatz database.

Classes Our [44] [100]1 97.92 87.5 n.r.2 100 100 n.r3 98.96 100 n.r4 98.96 95.8 n.r5 97.92 93.8 n.r6 88.54 95.8 n.r7 98.96 100 n.r8 98.96 97.9 n.r9 98.96 100 n.r10 100 97.9 n.r11 100 100 n.r12 98.96 97.9 n.r13 95.83 91.7 n.r

Average: 98.00 96.8 97.52 99.52

126 Applications

Chapter 6

Conclusions and Perspectives

6.1 Conclusions

In this thesis we have investigated content-based image retrieval and classification prob-

lems. For this purpose we have developed a structure-based feature extraction method that

encodes relative spatial arrangements of line segments. The method is capable of repre-

senting the global structure of an image, as well as the local structure of perceptual groups

and their connectivity. The advantage of the method is the invariance towards changes

in illumination, similarity transformations (rotation, translation and isotropic scaling) and

slight changes of the viewing angle. The results show that structure is a prominent and

highly discriminative feature.

In order to verify the correct extraction of structural information we have evaluated

various edge detectors. Next, we have presented a method that automatically computes

the best set of Canny parameters for real world images. We have evaluated 550 different

parameter sets and determined the best one by the comparison with a manually generated

ground truth. The best parameter set produces an error rate of only 0.1 to 1.3%. The

result shows that it is possible to restrict the range of the three Canny parameters to a

meaningful subrange.

In addition, we have used the edge maps to extract straight line segments with an edge

point tracking algorithm that was compared with the standard Hough transformation. The

result has shown that the standard Hough transformation could not robustly produce line

segments for our set of images. Though the Hough transform gives in general good results,

it is prone to misleading or false results for aligned objects. On the other hand, the edge

point tracking algorithm produces high quality maps of line segments that are very robust

to changes in illumination.

Next we have applied a straight line segment grouping method based on agglomera-

tive hierarchical clustering that automatically discards less important segments. We have

128 Conclusions and Perspectives

firstly proven the existence of an underlying clustering structure in the feature space with

the Hopkins test. A result of h=0.81 suggests the adequateness of our clustering method

with a confidence level of more then 90%. Secondly, we have evaluated the best linkage

method for the agglomerative hierarchical clustering by computing the cophenetic correla-

tion coefficient for 15972 different hierarchies. With an average score of 0.9262 the average

linkage method produces the best result. Thirdly, we have introduced a subgraph distance

ratio that is used in order to prune or cut a dendrogram. The resulting groups of straight

line segments are divided into salient and less important clusters on the basis of the intra-

class compactness. Subsequently, we have used the salient subset of line segments for the

computation of the invariant structural features.

The features have been applied to various content-based image retrieval and classifica-

tion problems. The first application was the classification and content-based image retrieval

of watermark images, where the features have proven as powerful descriptors. We could

have achieved a classification rate of more than 87% for the 14-class watermark problem.

Although the database features a high intra-class variability. The precision recall graphs

have shown good results for various classes. Some classes performed worse due to the high

intra-class variation. Therefore, we have presented several performance measures that show

the dynamics of the retrieval results. In addition, we have performed a partial matching for

various sample filigrees, that produced promising results. The partial matching turned out

to be computationally very expensive. Further investigation and algorithmic acceleration

is needed in order to make it more applicable.

The second application was a retrieval task of two color image databases from the Corel

collection with 1.000 and 10.000 images. We have shown that the global structure features

produce quite good results and could be improved with local block-based features. The

combination of the local and global information performed better for most classes, except

for the class Buses. That is because buses contain to a large extent linear structures which

favor the line segment features. In addition, the combination performed better than two

standard methods for both data-sets. We could have only observed one class (Horses),

from the 1.000 image set, where the color histogram clearly outperformed our features. A

visual inspection of that class revealed a clear domination of the green color for the Horses

images.

Since invariance plays a very important role for CBIR, we have evaluated the perfor-

mance of all methods under several image transformations. The analysis has shown that

the structure-based features are the most invariant against changes in illumination and

rotation, with an average score of more than 96%.

The third application was object class recognition and retrieval for the Caltech database.

We have used support vector machines to learn the feature space distribution of our

structure-based features for several images classes. For the classification we have used

6.2 Outlook and Perspectives 129

a multi-class SVM with a histogram intersection kernel. The results have been obtained

with a leave-one-out test. The results are competitive as shown by a comparison with

state of the art algorithms. In fact, we have obtained a 92.45% classification rate for the

five class problem and a 95.45% rate for the three class problem. The invariance analysis

showed a good robustness towards several image transformations.

In addition we have performed an image retrieval task for the same database. The

results were presented with averaged precision recall graphs and various performance mea-

sures. The results show that our structure-based features can be used for learning as well

as for image similarity tasks.

The fourth and final application was the classification of textures obtained from the

well known Brodatz collection. We have applied our line segment features to the 13-class

problem, that was evaluated with a SVM and a leave-one-out test. The intersection kernel

produced a very competitive classification rate of 98%.

We have shown the broad applicability of structure-based image features for classifi-

cation and content-based image retrieval tasks. The four presented applications comprise

tasks as broad as binary, color, object class and texture image retrieval and/or classifi-

cation. The features have proven invariant and of high quality in the comparison with

state-of-the art methods.

6.2 Outlook and Perspectives

In this thesis we have shown that structure can effectively be used in order to solve content-

based image retrieval and classification tasks. Although, we have presented four very

different applications, structure features can possibly be applied to various other tasks such

as for example satellite imagery, medical image analysis and retrieval or to astronomical

data repositories (e.g. the virtual observatory).

We are also interested in pursuing our results on the automatic estimation of the best

parameters for edge detectors. We will enhance the method with respect to salient ground

truth maps. Currently, we consider each correctly or falsely detected pixel equally impor-

tant, though it is easy to imagine that some edge are more salient then others.

Hierarchical clustering is an other point of interest to us. In this thesis we have applied

AHC to a set of line segments, although in general the method can be applied to arbi-

trary feature vectors. This generality motivates us for further experiments on clustering

features. The pruning method and cluster selection strategy presented in this thesis have

been applied to a set of line segments. However, we are interested in the performance for

other data patterns such as for example feature vectors. For arbitrary features, the cluster

selection process might be understood as the selection of important or salient features.

Thus, the method might contribute to the problem of feature selection.


The structure-based features show good discrimination abilities, so far. We are inter-

ested whether a formation of larger perceptual groups would improve the results or not.

Currently, we consider local groups of line segments that fulfill contain relative spatial

arrangements such as proximity, parallelity, perpendicularity, similar lengths or angles. In

future work we might develop an even more structured hierarchy of perceptual groups.

Moreover, we have shown partial matching results. Unfortunately, partial matching is

computationally very expensive. Therefore, we are interested in accelerating the matching

process and/or investigate in other/faster matching methods (Earth mover’s distance [119],

Hausdorff distance).

The experiments on the evaluation of the best similarity measure inspired us to conduct

further research. Specifically, we are interested if our observation that the intersection mea-

sure performs best, also holds for arbitrary feature sets. We believe that there is too little

literature on the behavior of similarity measures for different feature space distributions.

The good classification results presented in this thesis encourage us to use our structure-

based features for relevance feedback applications. We expect a significant improvement

of the performance.

Appendix A

A.1 Mercer Theorem

Theorem 1. Mercer’s Theorem. Let Ω be ⊂ Rn. Assume x ∈ Rn is a continuous and

symmetric function and that there exists a mapping Φ from Rn to H, where H is a Hilbert

space;

Φ : x 7→ Φ(x) ∈ H.

Then there is an equivalent representation of the inner product operation

∑

r

Φr(x)Φr(z) = K(x, z), (1)

where K(x, z) is a symmetric function, the so-called kernel function satisfying

∫

Ω×Ω

K(x, z)g(x)g(z)dxdz ≥ 0, ∀ g ∈ Φ(Ω), (2)

for any g(x), x ∈ Rn such that

∫

g(x)2dx < ∞. (3)


Bibliography

[1] S. Aksoy and R. M. Haralick. Probabilistic vs. geometric similarity measures for

image retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR

2000), 13-15 June 2000, SC, USA, pages 2357–2362, 2000.

[2] S. Aksoy and R. M. Haralick. Feature normalization and likelihood-based similarity

measures for image retrieval. Pattern Recognition Letters, 22(5):563–582, April 2001.

[3] Y. Avrithis, Y. Xirouxakis, and S. Kollias. Affine-invariant curve normalization

for object shape representation, classification and retrieval. Machine Vision and

Applications, 13, Issue 2:80–94, 2001.

[4] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for shape

matching and object recognition. In NIPS, pages 831–837, 2000.

[5] P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue

Software, 2002.

[6] A. Bhattacharyya. On a measure of divergence between two statistical populations

defined by their probability distributions. Bull. Calcutta Math. Soc., 35:99–110, 1943.

[7] H. H. Bock. On some significance tests in cluster analysis. Journal of Classification,

2:77–108, 1985.

[8] K.W. Bowyer, C. Kranenburg, and S. Dougherty. Edge detector evaluation using

empirical roc curves. Computer Vision and Image Processing (CVIU), 84(1):77–103,

October 2001.

[9] J. N. Breckenridge. Validating cluster analysis: Consistent replication and symmetry.

Multivariate Behavioral Research, 35:261–285, 2000.

[10] C. M. Briquet. Les filigranes, Dictionnaire historique des marques de papier d‘es leur

apparition vers 1282 jusqu’en 1600, Tome I B IV, Deuxieme ’edition. Verlag Von

Karl W. Hiersemann, Leipzig, 1923.

134 BIBLIOGRAPHY

[11] P. Brodatz. Textures – A Photographic Album for Artists and Designers. Dover

Publications, New York, 1966.

[12] R. Brunelli. Histogram analysis for image retrieval. Pattern Recognition, 34, 2001.

[13] G. Brunner and H. Burkhardt. Building classification of terrestrial images by generic

geometric hierarchical cluster analysis features. In IAPR Workshop on Machine

Vision Applications (MVA2005), pages 136–139, Tsukuba Science City, Japan, Apr.

2005.

[14] G. Brunner and H. Burkhardt. Structure features for content-based image retrieval.

In Pattern Recognition - Proc. of the 27th DAGM Symposium, Vienna, Austria, pages

425–433. Springer, Berlin, Aug. 2005.

[15] G. Brunner and Z.-M. Lu. Block-based and structure-based features for content-based

image retrieval. In preparation, 2006.

[16] C. Burges. A tutorial on support vector machines for pattern recognition. Data

Mining and Knowledge Discovery, 2(2):121–167, 1998.

[17] J. B. Burns, A. R. Hanson, and E. M. Riseman. Extracting straight lines. IEEE

Trans. on Pattern Analysis and Machine Intelligence, 1986.

[18] J. Canny. A computational approach to edge detection. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 8:679–698, 1986.

[19] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik. Blobworld: A

system for region-based image indexing and retrieval. In Third International Con-

ference on Visual Information Systems. Springer, 1999.

[20] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.

Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[21] O. Chapelle, P. Haffner, and V. Vapnik. SVMs for histogram-based image classifica-

tion. IEEE Transaction on Neural Networks, 10(5):1055–1064, 1999.

[22] Corel Inc. Corel’s 1000 images database.

[23] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons,

1991.

[24] M. H. DeGroot. Probability and Statistics 3rd ed. Reading, MA: Addison-Wesley,

1991.

BIBLIOGRAPHY 135

[25] T. Deselaers, D. Keysers, and H. Ney. Features for image retrieval: A quantitative

comparison. In C. E. Rasmussen, H. H. Bulthoff, M. A. Giese, and B. Scholkopf,

editors, Pattern Recognition - Proc. of the 26th DAGM Symposium, LNCS 3175,

Tubingen, Germany, pages 228–236. Springer, Berlin, Aug. 2004.

[26] M. Do and M. Vetterli. Wavelet-based texture retrieval using generalized gaussian

density and kullback-leibler distance. IEEE Trans.on Image Proc., 146-158 2002.

[27] A. Dorado and E. Izquierdo. Semi-automatic image annotation using frequent key-

word mining. In Proceedings of the Seventh International Conference on Information

Visualization, volume 00, pages 532–535, Los Alamitos, CA, USA, 2003. IEEE Com-

puter Society.

[28] J. Eakins, A. Jean, E. Brown, J. Riley, and R. Mulholland. Evaluating a shape

retrieval system for watermark images. In A. Bentkowska, T. Cashen, and J. Sun-

derland, editors, CHArt Conference Proceedings, volume four; Digital Art History -

A Subject in Transition: Opportunities and Problems, British Academy in London

on 28th and 29th November 2001, 2002.

[29] P. Enser and C. Sandom. Towards a comprehensive survey of the semantic gap in

visual image retrieval. In CIVR 2003 - International Conference on Image. and Video

Retrieval, Urbana, IL, USA, 24-25 July, LNCS 2728, pages 291–299, 2003.

[30] F. J. Estrada and A. D. Jepson. Perceptual grouping for contour extraction. In 17th

International Conference on Pattern Recognition (ICPR), volume 02, pages 32–35,

Los Alamitos, CA, USA, 2004. IEEE Computer Society.

[31] B. S. Everitt. Cluster Analysis 2nd Edition. Heineman Educational Books Ltd., 2nd

Edition, London, 1980.

[32] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised

scale-invariant learning. In IEEE Computer Society Conference on Computer Vision

and Pattern Recognition (CVPR), volume 02, page 264, Los Alamitos, CA, USA,

2003. IEEE Computer Society.

[33] A. Fischer, T. H. Kolbe, F. Lang, A. B. Cremers, W. Forstner, L. Plumer, and

V. Steinhage. Extracting buildings from aerial images using hierarchical aggregation

in 2D and 3D. Computer Vision and Image Understanding: CVIU, 72(2):185–203,

1998.

[34] T. Frese, C. Bouman, and J. Allebach. A methodology for designing image similarity

metrics based on human visual system models. Technical report, Tech. Rep. TR-ECE

97-2, Purdue University, West Lafayette, 1997.

136 BIBLIOGRAPHY

[35] C. Galambos, J. Kittler, and J. Matas. Progressive probabilistic hough transform

for line detection. In IEEE Computer Society Conference on Computer Vision and

Pattern Recognition (CVPR’99), volume 01, page 1554, Los Alamitos, CA, USA,

1999. IEEE Computer Society.

[36] Y. Gao and M. K. H. Leung. Line segment hausdorff distance on face matching.

Pattern Recognition, 35(2):361–371, 2002.

[37] M. Gerke, C. Heipke, and B.-M. Straub. Building extraction from aerial imagery

using a generic scene model and invariant geometric moments. In Proceedings of

the IEEE/ISPRS joint Workshop on Remote Sensing and Data Fusion over Urban

Areas, University of Pavia, Rome (Italy), pages 85–89, Nov. 2001.

[38] T. Gevers and A. Smeulders. Color based object recognition. In ICIAP (1), pages

319–326, 1997.

[39] A. D. Gordon. Clustering algorithms and cluster validity. In P. Dischdedt, R. Os-

termann (Eds.), Computational Statistics: Papers Collected on the Occasion of the

25th Conference on Statistical Computing, pages 497–512. Physica-Verlag, Heidel-

berg, 1994.

[40] K. C. Gowda. Cluster detection in a collection of collinear line segments. Pattern

Recognition, 17(2):221–237, 1984.

[41] A. Graf, A. Smola, and S. Borer. Classification in a normalized feature space using

support vector machines. IEEE Transactions on Neural Networks, 14(3):597–605,

2003.

[42] D. S. Guru, B. H. Shekar, and P. Nagabhushan. A simple and robust line detection

algorithm based on small eigenvalue analysis. Pattern Recognition Letters, 25(1):1–

13, January 2004.

[43] A. Halawani and H. Burkhardt. Image Retrieval by Local Evaluation of Nonlinear

Kernel Functions around Salient Points. In Proceedings of the 17th International

Conference on Pattern Recognition (ICPR-2004), Cambridge, United Kingdom, Aug.

2004.

[44] G. M. Haley and B. S. Manjunath. Rotation-invariant texture classification us-

ing a complete space-frequency model. IEEE Transactions on Image Processing, 8,

no.2:255–69, Feb. 1999.

BIBLIOGRAPHY 137

[45] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering algorithms and validity

measures. In Proceedings of the 13th International Conference on Scientific and Sta-

tistical Database Management, July 18-20, 2001, George Mason University, Fairfax,

Virginia, USA, pages 3–22, 2001.

[46] J. A. Hartigan. Representation of similarity matrices by trees. Journal of the Amer-

ican Statistical Association, 62:1140–1158, 1967.

[47] M. D. Heath, S. Sarkar, T. Sanocki, and K. W. Bowyer. A robust visual method for

assessing the relative performance of edge detection algorithms. IEEE Trans. Pattern

Analysis and Machine Intelligence (PAMI), 19(12):1338–1359, December 1997.

[48] W. Hl and C. Mu. Image semantic classification by using SVM. Journal of Software,

14(11):1891–1899, 2003.

[49] P. Hobson and Y. Kompatsiaris. Advances in semantic multimedia analysis for per-

sonalized content access. In Special Session on Advances in Semantic Multimedia

Analysis for Personalized Content Access, 2006 IEEE International Symposium on

Circuits and Systems (ISCAS 2006), Kos Island, Greece, 21 - 24 May, 2006.

[50] D. Hoiem, R. Sukthankar, H. Schneiderman, and L. Huston. Object-based image

retrieval using the statistical structure of images. In IEEE Conference on Computer

Vision and Pattern Recognition, June 2004.

[51] B. Hopkins. A new method for determining the type of distribution of plan-

individuals. Annals of Botany, 18:213–226, 1954.

[52] P. V. C. Hough. Method and Means for Recognizing Complex Patterns. U.S. Patent

3,069,654, Dec 1962.

[53] P. Howarth, A. Yavlinsky, D. Heesch, and S. M. Rger. Medical image retrieval

using texture, locality and colour. In C. Peters, P Clough, J Gonzalo, G.J.F. Jones,

M. Kluck, and B. Magnini, editors, Multilingual Information Access for Text, Speech

and Images, 5th Workshop of the Cross-Language Evaluation Forum, CLEF 2004,

Bath, UK, pages 740–749, 2004.

[54] B. Huet and E. R. Hancock. Line pattern retrieval using relational histograms. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 21(12):1363–1370, 1999.

[55] IHP. International Standard for the Registration of Watermarks. International As-

sociation of Paper Historians (IHP), 1998.

138 BIBLIOGRAPHY

[56] Q. Iqbal and J. K. Aggarwal. Using structure in content-based image retrieval. In

IASTED International Conference on Signal and Image Processing (SIP 99), pages

129–133, 1999.

[57] Q. Iqbal and J. K. Aggarwal. Combining structure, color and texture for image

retrieval: A performance evaluation. In Proceedings of the International Conference

on Pattern Recognition, volume 3, Quebec, Canada, pages 438–443, 2002.

[58] Q. Iqbal and J. K. Aggarwal. Retrieval by classification of images containing large

man-made objects using perceptual grouping. Pattern Recognition, 35:1463–1479,

July 2002.

[59] A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc.,

Upper Saddle River, NJ, USA, 1988.

[60] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing

Surveys, 31(3):264–323, 1999.

[61] A. K. Jain and A. Vailaya. Image retrieval using color and shape. Pattern Recognition,

29(8):1233–1244, August 1996.

[62] A. Jonk and A. Smeulders. An axiomatic approach to clustering line-segments. In

Proceedings of the Third International Conference on Document Analysis and Recog-

nition, volume 1, pages 386–389, Aug. 1995.

[63] L. Kaufmann and P. J. Rousseeuw. Finding Groups in Data. John Wiley & Sons,

Inc.,New York, 2005.

[64] A. Kimura, T. Kawanishi, and K. Kashino. Similarity-based partial image retrieval

guaranteeing same accuracy as exhaustive matching. In IEEE International Confer-

ence on Multimedia and Expo, (ICME), pages 1895–1898, 2004.

[65] R. A. Kirsch. Computer determination of the constituent structure of biological

images. Comp. Biomed. Res., 4(3):315–328, June 1971.

[66] K. Koffka. Principles of Gestalt Psychology. Harcourt, Brace and World, Inc., New

York, 1935.

[67] A. N. Kolmogorov. On the empirical determination of a distribution function (ital-

ian). Giornale dell’Instituto Italiano degli Attuari, 4:83–91, 1933.

[68] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu. Statistical edge detection:

learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and

Machine Intelligence (PAMI), 25(1):57–74, January 2003.

BIBLIOGRAPHY 139

[69] P. D. Kovesi. Edges are not just steps. In Proceedings of the Fifth Asian Conference

on Computer Vision, pages 822–827, January 2002. Melbourne.

[70] J. K. P. Kuan and P. H. Lewis. Complex textures classification with edge information.

In In Proceedings of Proceedings on Second International Conf. on Visual Information

System. San Diego, 1997.

[71] M. Kubat, R. Holte, and S. Matwin. Learning when negative examples abound. In

Proc. of the European Conference on Machine Learning, ECML97, Prague, pages

146–153, 1997.

[72] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathemat-

ical Statistics, 22:79–86, 1951.

[73] R. G. Lawson and P. J. Jurs. New index for clustering tendency and its application

to chemical problems. J. Chem. Inf. Comput. Sci., 30:36–41, 1990.

[74] L. Lee. Measures of distributional similarity. In 37th Annual Meeting of the Associ-

ation for Computational Linguistics, pages 25–32, 1999.

[75] T. K. Leen, T. G. Dietterich, and V. Tresp, editors. Advances in Neural Information

Processing Systems 13, Papers from Neural Information Processing Systems (NIPS)

2000, Denver, CO, USA. MIT Press, 2001.

[76] S. Lefevre, C. Dixon, C. Jeusse, and N. Vincent. A local approach for fast line

detection. In Digital Signal Processing, 2002. DSP 2002. 2002 14th International

Conference on, volume 2, pages 1109–1112, 2002.

[77] S. Lessmann. Solving imbalanced classification problems with support vector ma-

chines. In H. R. Arabnia, editor, IC-AI, pages 214–220. CSREA Press, 2004.

[78] M. S. Lew, N. Sebe, and J. P. Eakins, editors. Image and Video Retrieval, Interna-

tional Conference, CIVR 2002, London, UK, July 18-19, 2002, Proceedings, volume

2383 of Lecture Notes in Computer Science. Springer, 2002.

[79] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In SIGIR

’94: Proceedings of the 17th annual international ACM SIGIR conference on Research

and development in information retrieval, pages 3–12, New York, NY, USA, 1994.

Springer-Verlag New York, Inc.

[80] F. Li, J. Kosecka, and H. Wechsler. Strangeness based feature selection for part based

recognition. In Computer Vision and Pattern Recognition, 2006 Conference on, June

2006.

140 BIBLIOGRAPHY

[81] H. Li, S.-C. Yan, and L.-Z. Peng. Robust non-frontal face alignment with edge based

texture. J. Comput. Sci. Technol., 20(6):849–854, 2005.

[82] J. Li and J. Wang. Automatic linguistic indexing of pictures by a statistical modeling

approach. IEEE Trans. Pattern Anal. Mach. Intell., 25(9):1075–1088, 2003.

[83] Y. Li and L. G. Shapiro. Consistent line cluster for building recognition in cbir.

In Proceedings of the International Conference on Pattern Recognition, volume 3,

Quebec, Canada, pages 952–956, 2002.

[84] S. Liapis and G. Tziritas. Image retrieval by colour and texture using chromaticity

histograms and wavelet frames. In Visual Information and Information Systems,

pages 397–406, 2000.

[85] Z. J. Liu, J. Wang, and W. P. Liu. Building extraction from high resolution - imagery

based on multi-scale object oriented classification and probabilistic hough transform.

In Proceedings of the IGARSS 2005 Symposium. Seoul, Korea. July 25-29, 2005.

[86] Z.-M. Lu, C Liu, and S. Sun. Digital image watermarking technique based on block

truncation coding with vector quantization. Chinese Journal of Electronics (English

Version), 11(2):152–157, 2002.

[87] Z.M. Lu, H. Skibbe, and H. Burkhardt. Image retrieval based on a multipurpose

watermarking scheme. In International Workshop on Intelligent Information Hiding

and Multimedia Signal Processing, Melbourne, Australia, September 2005.

[88] F. Del Marmol. Dictionnaire des filigranes classes en groupes alphabetique et

chronologiques. Namur: J. Godenne, 1900. -XIV, 1987.

[89] D. Marr and E. C. Hildreth. Theory of edge detection. Proceedings of Royal Society

of London, B-207:187–217, 1980.

[90] S. Meignier, J. Bonastre, and I. Magrin-Chagnolleau. Speaker utterances tying among

speaker segmented audio documents using hierarchical classification: towards speaker

indexing of audio databases. In Proceedings of International Conference on Spoken

Language Processing ICSLP., 2002.

[91] K. Mikolajczyk, A. Zisserman, and C. Schmid. Shape recognition with edge-based

features. In British Machine Vision Conference, volume 2, pages 779–788, September

2003.

[92] M. Mirmehdi and B. T Thomas, editors. Proceedings of the British Machine Vision

Conference 2000, BMVC 2000, Bristol, UK, 11-14 September 2000. British Machine

Vision Association, 2000.

BIBLIOGRAPHY 141

[93] R. Mohan and R. Nevatia. Perceptual organization for scene segmentation and

description. IEEE Transactions on Pattern Analysis and Machine Intelligence,

14(6):616–635, 1992.

[94] N. Mukhopadhyay. Probability and Statistical Inference. New York: Dekker, 2000.

[95] H. Muller, W. Muller, D. Squire, S. Marchand-Maillet, and T. Pun. Performance

evaluation in content-based image retrieval: overview and proposals. Pattern Recogn.

Lett., 22(5):593–601, 2001.

[96] K. Murakami and T. Naruse. High speed line detection by hough transform in local

area. 15th International Conference on Pattern Recognition (ICPR’00), 03:3471,

2000.

[97] A. Natsev, R. Rastogi, and K. Shim. Walrus: A similarity retrieval algorithm for im-

age databases. IEEE Transactions on Knowledge and Data Engineering, 16(3):301–

316, 2004.

[98] H. Neemuchwala, A. O. Hero, and Carson P. L. Image matching using alpha-entropy

measures and entropic graphs. European Journal of Signal Processing (Special Issue

on Content-based Visual Information Retrieval), Mar. 2004.

[99] A. Neri. Optimal detection and estimation of straight patterns. IEEE Trans. Image

Processing, 5(5):787–792, May 1996.

[100] T. Ojala, M. Pietikainen, and T. Maenpaa. Gray scale and rotation invariant texture

classification with local binary patterns. In D. Vernon, editor, ECCV (1), volume

1842 of Lecture Notes in Computer Science, pages 404–420. Springer, 2000.

[101] B. Ommer and J. Buhmann. Object categorization by compositional graphical mod-

els. In Rangarajan et al. [111], pages 235–250.

[102] M. Ortega, Y. Rui, K. Chakrabarti, S. Mehrotra, and T. S. Huang. Supporting

similarity queries in MARS. In ACM Multimedia, pages 403–413, 1997.

[103] D. Patel and T.J. Stonham. Texture image classification and segmentation using

rank-order clustering. In Proceedings of the 11th International Conference on Pattern

Recognition, The Hague, The Netherlands, pages 92–95, 1992.

[104] S. M. Peres and M. L. de Andrade-Netto. A fractal fuzzy approach to clustering

tendency analysis. In Lecture Notes in Computer Science, volume 3171, pages 395–

404, Jan. 2004.

142 BIBLIOGRAPHY

[105] A. R. Pope and D. G. Lowe. Vista: A software environment for computer vision

research. In CVPR94, pages 768–772, 1994.

[106] A. P. D. Poz, G. M. D. Vale, and Zanin I. R. B. Automated road segment ex-

traction by grouping road objects. In XXth ISPRS Congress.’Geo-Imagery Bridging

Continents’, 12-23 July 2004, Istanbul, Turkey, 2004.

[107] J. M. S . Prewitt. Object enhancement and extraction. New York: Academic, 1970.

[108] J. Princen, J. Illingworth, and J. Kittler. A hierarchical approach to line extraction

based on the Hough transform. Comput Vision Graphics Image Process (CVGPDB),

52(1):57–77, October 1990.

[109] J. Puzicha. Distribution-based image similarity. In State-of-the-Art in Content-Based

Image and Video Retrieval [141], pages 143–164.

[110] G. Qian, S. Sural, Y. Gu, and S. Pramanik. Similarity between euclidean and cosine

angle distance for nearest neighbor queries. In SAC ’04: Proceedings of the 2004

ACM symposium on Applied computing, pages 1232–1237, New York, NY, USA,

2004. ACM Press.

[111] A. Rangarajan, B. C. Vemuri, and A. L. Yuille, editors. Energy Minimization Methods

in Computer Vision and Pattern Recognition, 5th International Workshop, EMM-

CVPR 2005, St. Augustine, FL, USA, November 9-11, 2005, Proceedings, volume

3757 of Lecture Notes in Computer Science. Springer, 2005.

[112] C. Rauber. Acquisition, archivage et recherche de documents accessibles par le con-

tenu: Application a la gestion d’une base de donnees d’images de filigranes. Ph.D.

Dissertation No. 2988, University of Geneva, Switzerland, March 1998.

[113] C. Rauber, T. Pun, and P. Tschudin. Retrieval of images from a library of water-

marks for ancient paper identification. In EVA 97, Elektronische Bildverarbeitung

und Kunst, Kultur, Historie, Berlin, Germany, Nov. 12 - 14, 1997., 1997.

[114] C. Rauber, J. Ruanaidh, and T. Pun. Secure distribution of watermarked images

for a digital library of ancient papers. In DL ’97: Proceedings of the second ACM

international conference on Digital libraries, pages 123–130, New York, NY, USA,

1997. ACM Press.

[115] K. J. Riley and J. P. Eakins. Content-based retrieval of historical watermark images:

I-tracings. In Lew et al. [78], pages 253–261.

[116] G. S. Robinson. Edge detection by compass gradient masks. Comput. Graphics Image

Processing, 6, 1977.

BIBLIOGRAPHY 143

[117] A. Robles-Kelly and E. R. Hancock. Grouping line-segments using eigenclustering.

In Mirmehdi and Thomas [92].

[118] O. Ronneberger and F. Pigorsch. LIBSVMTL: A support vector machine

template library, 2004. Software available at http://lmb.informatik.unifreiburg.

de/lmbsoft/libsvmtl/.

[119] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for

image retrieval. International Journal of Computer Vision, 40(2):99–121, Nov. 2000.

[120] Y. Rui, T. Huang, and S. Chang. Image retrieval: current techniques, promising

directions and open issues. Journal of Visual Communication and Image Represen-

tation, 10:39–62, April 1999.

[121] M. Johnston S. Gallant. Image retrieval using image context vectors: First results.

In Storage and Retrieval for Image and Video Databases (SPIE), pages 82–94, 1995.

[122] S. Santini and R. Jain. Similarity measures. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 21(9):871–883, 1999.

[123] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,

Regularization, Optimization and Beyond. MIT Press, 2002.

[124] H. Schulz-Mirbach. Invariant features for gray scale images. In G. Sagerer, S. Posch,

and F. Kummert, editors, 17. DAGM - Symposium “Mustererkennung”, pages 1–14,

Bielefeld, 1995. Springer.

[125] N. Sebe, M. Lew, and N. Huijsmans. Towards optimal ranking metrics. IEEE Trans.

on Pattern Analysis and Machine Intel. (PAMI), pages 1132–1143, Oct. 2000.

[126] M. C. Shin, D. B. Goldgof, K. W. Bowyer, and S. Nikiforou. Comparison of edge

detection algorithms using a structure from motion task. IEEE Trans. Systems, Man

and Cybernetics Part B, SMC-B, 31(4):589–601, August 2001.

[127] S. Siggelkow. Feature Historgrams for Content-Based Image Retrieval. PhD thesis,

Albert-Ludwigs-Universitat Freiburg, December 2002.

[128] S. Siggelkow and H. Burkhardt. Image retrieval based on local invariant features. In

Proceedings of the IASTED International Conference on Signal and Image Processing

(SIP) 1998, pages 369–373, Las Vegas, Nevada, USA, October 1998. IASTED.

[129] S. Siggelkow, M. Schael, and H. Burkhardt. SIMBA — Search IMages By Appear-

ance. Lecture Notes in Computer Science, 2191, 2001.

144 BIBLIOGRAPHY

[130] A. Smeulders, M. Worring, S. Santiniand, A. Gupta, and R. Jain. Content-based

image retrieval at the end of the early years. IEEE Trans. Pattern Analysis and

Machine Intelligence, 22:1349–1380, Dec. 2000.

[131] J. R. Smith and S.-F. Chang. Visualseek: A fully automated content-based image

query system. In ACM Multimedia, pages 87–98, 1996.

[132] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish. Clustering Speakers by their

Voices. In IEEE International Conference On acoustics, speech, and signal processing

ICASSP, volume 2, pages 757–760, 1998.

[133] M. J. Swain and D. H. Ballard. Color indexing. Int. J. Comput. Vision, 7(1):11–32,

1991.

[134] Ojala T., Pietikinen M., and Silven O. Edge-based texture measures for surface in-

spection. In Proc. 11th International Conference on Pattern Recognition, The Hague,

The Netherlands, volume 2, pages 594–598, 1992.

[135] J.-P. Tarel and S. Boughorbel. On the choice of similarity measures for image retrieval

by example. In Proceedings of ACM MultiMedia Conference, pages 446 – 455, Juan-

Les-Pins, France, 2002.

[136] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1998.

[137] K. Tsuda, M. Minoh, and K. Ikeda. Extracting straight-lines by sequential fuzzy

clustering. Pattern Recognition Letters, 17(6):643–649, May 1996.

[138] I. Valova and B. Rachev. Retrieval by color features in image databases. In ADBIS

(Local Proceedings), 2004.

[139] V. Vapnik. The Nature of Statistical Learning Theory. Springer, N.Y., 1995.

[140] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, 1998.

[141] R. Veltkamp, H. Burkhardt, and H.-P. Kriegel. State-of-the-Art in Content-Based

Image and Video Retrieval. Kluwer Academic Publishers, 2001.

[142] R. Veltkamp and M. Tanase. Content-based image retrieval systems: A survey. In

Oge Marques and Borko Furht, editors, In O. Marques and B. Furht (Eds.), Content-

based image and video retrieval, pages 47–101. Kluwer Academic Publishers, 2002.

[143] J. Wang, J. Li, and G. Wiederhold. SIMPLIcity: Semantics-sensitive integrated

matching for picture LIbraries. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 23(9):947–963, 2001.

BIBLIOGRAPHY 145

[144] S. Wang, F. Ge, and T. Liu. Evaluating edge detection through boundary detection.

EURASIP Journal on Applied Signal Processing, 2006:1–15, 2006.

[145] J. H. Ward. Hierarchical grouping to optimize an objective function. American

Statistical Association, 58:234–244, 1963.

[146] R. W. White and J. M. Jose. A study of topic similarity measures. In SIGIR ’04:

Proceedings of the 27th annual international ACM SIGIR conference on Research

and development in information retrieval, pages 520–521, New York, NY, USA, 2004.

ACM Press.

[147] R. Xu and D. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural

Networks, 16:645–678, May 2005.

[148] T.-J. Yen. A Qualitative Profile-based Approach to Edge Detection. PhD thesis,

Department of Computer Science New York University, 2003.

[149] Y. Yitzhaky and E. Peli. A method for objective edge detection evaluation and

detector parameter selection. IEEE Transactions on Pattern Analysis and Machine

Intelligence (PAMI), 25(8):1027–1033, August 2003.

[150] J. M. Zacks and B. Tversky. Event structure in perception and conception. Psycho-

logical Bulletin, 127(1):3–21, 2001.

[151] O. Zamir, O. Etzioni, O. Madani, and R. M. Karp. Fast and intuitive clustering of

web documents. In In Proceedings of the 3rd International Conference on Knowledge

Discovery and Data Mining, pages 287–290, 1997.

[152] X. Zhou and T. S. Huang. Edge-based structural features for content-based image

retrieval. Pattern Recognition Letters, 22(5):457–468, 2001.

[153] D. Ziou and S. Tabbone. Edge detection techniques - an overview. International

Journal of Pattern Recognition and Image Analysis, 8:537–559, 1998.

Documents

Structure Features for Content-Based Image Retrieval and