Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Context Dependent Segmentation and Matching inImage Databases
Hayit GreenspanFaculty of Engineering,
Tel-Aviv University,Tel-Aviv 69978, Israel
Guy DvirFaculty of Engineering,
Tel-Aviv University,Tel-Aviv 69978, Israel
Yossi RubnerApplied Materials, Israel
July 31, 2003
Abstract
The content of an image can be summarized by a set of homogeneous regions inan appropriate feature space. When exact shape is not important, the regions can berepresented by simple “blobs”. Even for similar images, the blob representation of thetwo images might vary in shape, position, the number of blobs, and the representedfeatures. In addition, separate blobs in one image might correspond to a single blobin the other image and vice versa. In this paper we present the BlobEMD frameworkas a novel method to compute the dissimilarity of two sets of blobs while allowing forcontext-based adaptation of the image representation. This results in representationthat represent well the original images but at the same time are best aligned withrespect to the representation of the context images.
We compute the blobs by using Gaussian mixture modeling and use the EarthMover’s Distance (EMD) to compute both the dissimilarity of the images and theflow matrix of the blobs between the images. The BlobEMD flow-matrix is used tofind optimal correspondences between source and target image representations and toadapt the representation of the source image to that of the target image. This allowsfor similarity measures between images that are insensitive to the segmentation processand to different levels of details of the representation. We show applications of thismethod for content-based image retrieval, image segmentation, and matching modelsof heavily dithered images with models of full resolution images.
1 Introduction
Many content-based retrieval works rely on an initial segmentation of the input and archived
images. Yet, image segmentation remains one of the more challenging problems in computer-
vision and often is not well defined, as different contents entail different segmentations of
the same image. For example, in some contexts it is more appropriate to segment together
all the trees in an image of a forest while in other contexts, each tree should stand by its
own. In this work we address the challenge of comparing similar images that are segmented
differently and/or are represented at varying level of resolution, as is the case in dithered
images.
The “BlobEMD” framework is proposed in this work as a simultaneous solution to both
the image representation problem and the estimation of the distance between images. This
coupling allows for context-based model adaptation where the representation of one image
is adjusted based on the representation of a second image - the context. The framework
combines an initial transition from image pixels to representative image regions (segments
or blobs) via Gaussian mixture modelling (GMM) [2], followed by utilizing the Earth mover’s
distance measure (EMD) [19] for finding the optimal correspondences between regions in the
two images, and extracting an overall image matching measure between two input images.
The correspondences between the regions in the two images are used to merge and spit the
regions, so they still represent well the images but at the same time bring the two representa-
tions to a common context. For example, the problem of image segmentation is treated here
as an image pair (source-target) task. Thus, an image will be segmented differently based
on the target image. The suggested framework provides for image representations that are
more uniform and best aligned between the two images to be matched.
The overall framework of the image representation and matching phases is represented in
Figure 1. In section 2 we review some of the related work and motivate the proposed scheme.
The BlobEMD framework is presented in section 3. In addition to the distance between two
sets of blobs, the BlobEMD results in a flow matrix with correspondences between blobs.
In section 4 we focus on the flow-matrix and provide a set of rules for extracting region-
correspondences between images and for image model adaptation. Experimental evaluation
of the BlobEMD framework, along with its application to context-based image segmentation
2
Figure 1: A block diagram of the BlobEMD matching system
and robust image matching are presented in section 5.
2 Related Work
Histograms are the classical means of representing image content and are widely used as
the chosen image representation [8, 1]. A histogram is a discrete representation of the
continuous feature space, generated by a partitioning of the feature space. The partitioning is
determined by the feature space chosen (e.g. the color space representation), the quantization
scheme chosen (such as uniform or vector quantization), as well as computational and storage
considerations. Color histograms advantages and disadvantages are well studied [23] and
many variations exist [16, 22, 13].
Several measures have been proposed for the dissimilarity between two histograms. In
general they can be divided into two categories [20, 17]: “bin-by-bin” measures, that compare
contents of corresponding histogram bins, and “cross-bin” measures that enable compar-
isons across non-corresponding bins. In the first category are included the Minkowski-form
distance, the histogram intersection (H.I.) measure [23, 20], the χ2 statistics, the Kullback-
Leibler (KL) divergence [14, 4], and others. “Cross-bin” measures combine also the feature
3
space information of the bins (e.g. the dissimilarities between colors represented by the his-
togram bins). Such measures include the Quadratic-form distance [11] in which a similarity
matrix is included to represent similarity between bins. The Earth mover’s distance measure
[19] extracts dominant modes from a histogram, as a signature, and defines a measure of sim-
ilarity between signatures. Additional distance measures between histogram representations
in an image matching task are evaluated and compared in [19, 17, 20].
The histogram representation has been extended recently to include additional features
as well as spatial information. In [16] each entry of a “joint” histogram contains the number
of pixels in the image that are described by a particular combination of feature values. In
[22] local information is included by dividing an image into five fixed overlapping blocks and
extracting the first three color moments of each block to form a feature vector for the image.
In [13] correlograms are proposed to take into account the local color spatial correlation as
well as the global distribution of the spatial correlation.
Other works in image representation include “region-based” approaches. Image regions
are the basic building blocks in forming the visual content of an image, and thus have great
potential in representing the image content and enabling image matching. In [21] Smith
and Chang store the location of each color that is present in a sufficient amount in regions
computed using histogram backprojection. Ma and Manjunath [15] perform retrieval based
on segmented image regions. The segmentation is not fully automatic as it requires some
parametric tuning and hand pruning of regions. Unsupervised segmentation of an image
into homogeneous regions in feature space, such as the color and texture space, can be found
in the “Blobworld” image representation [2, 3]. In [2] a naive Bayes algorithm is used to
learn image categories from the blob representation in a supervised learning scheme. The
framework suggested entails learning blob-rules per category. Thus, one may argue that
there is a shift to a high-level image description (image labeling). Each query image is next
compared with the extracted category models, and associated with the closest matching
category. In [3] the user composes a query by viewing the Blobworld representation and
selecting the blobs to match along with possible weighting of the blob features. A query may
include a combination (conjunction) of two blobs. In essence, the image matching problem
is shifted to a (one or two) blob matching problem. Each blob is compared with all blobs in
each database image. Spatial information is thus included, yet in a very concise manner. It
4
should be noted that each blob is represented by a color histogram, thus the representation
is a discrete representation (in the image plane as well as in feature space).
An extension to the Blobworld system, termed “GMM-KL” framework, has recently
been proposed [10]. The set of regions in an image is represented by a continuous Gaussian
mixture model (GMM). Images are next compared and matched via the continuous and
probabilistic KL distance between distributions. The GMM-KL framework achieves strong
matching results between images while addressing the problem of ‘multiple-blob’ to ‘multiple-
blob’ matching. In the current work we similarly extend the Blobworld system to address
the ‘multiple-blob’ matching problem. The continuous GMM representation is used in the
image representation stage following which we utilize the EMD distance measure in the
matching stage. In addition to providing a distance measure between multiple blob sets,
the BlobEMD framework generates a flow-matrix which provides correspondences between
individual source and target blobs. Thus the BlobEMD flow-matrix addresses the region-
correspondence problem between the two images. This information is used for context-based
image model adaptation, as will be exemplified in the following sections.
3 The BlobEMD Framework
In order to measure similarities between images that are represented by homogeneous regions,
we need to define an appropriate dissimilarity measure. This problem is harder when the two
sets of regions don’t have clear correspondences, and often, a region in one image matches
the union of several regions, or parts of regions in the second image. An example for this
can seen in Figure 8 (a). Both images show a lake and two trees. However, in the left image
the lake is represented by a single region while in the right image it is represented by three
regions. Similarly, the tree-tops in the right image are combined into a single region. In
order for the dissimilarity measure to perform properly, it should solve these cases. This is
done by the BlobEMD framework.
The BlobEMD framework [9] consists of three main steps (see Figure 1): First, each
input image is modeled as a Gaussian mixture distribution in a selected feature space. The
EMD is next utilized for measuring similarity between the respective models of two images.
In addition to the similarity measure between sets of regions, the EMD also returns the
5
correspondence (flow) between them. The third step uses these correspondences to adapt
one (source) image model based on the model of the second (target) image. Adaptation of the
image models achieves context based modeling and segmentation, and provides better overall
image similarity measures. The three steps are described in more detail in the following
sections.
3.1 Image representation via Gaussian mixture modeling
In the representation phase, each homogeneous region in the image is represented by a
Gaussian distribution and the set of regions in the image is represented by a Gaussian mixture
model (GMM). Pixels are grouped into homogeneous regions in the image plane by grouping
feature vectors in a selected feature space. We use the five-dimensional feature space of color
and space (L, a, b, x, y), where (L, a, b) is the 3-dimensional CIE-Lab color space [24], and
(x, y) is the spatial image plane. We use the CIE-Lab color space as it was designed so that
(short) Euclidean distances between two colors match perceptual similarity. The underlying
assumption is that the image colors and their spatial distribution in the image plane are
generated by a mixture of Gaussians. It should be noted that the representation model is
general, and can incorporate any desired feature space (such as color, texture, shape, etc) or
combination thereof.
The distribution of a random variable X ∈Rd is a mixture of k Gaussians if its density
function is:
f(x|θ) =k∑
j=1
αj1√
(2π)d|Σj|exp{−1
2(x− µj)
T Σ−1j (x− µj), (1)
such that the parameter set θ = {αj, µj, Σj}kj=1 consists of: αj > 0,
∑kj=1 αj = 1, µj ∈Rd
and Σj is a d×d positive definite matrix.
Given a set of feature vectors x1, ..., xn, the maximum likelihood estimation of θ is :
θML = arg maxθ
f(x1, ..., xn|θ). (2)
Since a closed form solution for this maximization problem is not possible, we utilize the
Expectation-Maximization (EM) algorithm [5] as an iterative method to obtain θML (similar
to [3]).
6
The iterative EM algorithm is initialized via the K-means algorithm [7], and is repeated
until the log-likelihood measure is increased by less than a predefined threshold (1%) from
one iteration to the next. The MDL principle [4] is used to select the number of mixture
components (or number of means), k, as best suits the natural number of groups present in
the image.
Once we associate a Gaussian mixture model to an image, the image can be viewed as a
set of independently identically distributed samples from the Gaussian mixture distribution.
Examples of images with their respective models are shown in Figures 8 - 11. Each localized
Gaussian mixture is shown as a set of ellipsoids, with each ellipsoid representing the support,
mean color and spatial layout, of a particular Gaussian in the image plane. The variability
in the number of regions, their layouts and colors for similar context input images, is evident
in the GMM representation as in the image plane.
3.2 The Earth Mover’s Distance (EMD)
In [19] the concept of the Earth Mover’s Distance is introduced as a flexible similarity measure
between multidimensional distributions, and is described in detail therein. Intuitively, given
two distributions represented by sets of weighted features, one can be seen as a mass of
“earth” properly spread in the feature space, the other as a collection of “holes” in that
same space. The EMD measures the least amount of work needed to fill the holes with earth.
Here, a unit of work corresponds to transporting a unit of earth by a unit of ground distance
which is a distance in the feature space. The EMD is based on the transportation problem
[12] and can be solved efficiently by linear optimization algorithms that take advantage of
its special structure.
Formally, let S = {(s1, ws1), . . . , (sm, wsm)} be the first set with m regions, where si is the
region descriptor and wsiis the weight of the region; T = {(t1, wt1), . . . , (tn, wtn)} the second
set with n regions; and DIST = [dist(si, tj)] the ground distance matrix where dist(si, tj) is
the distance between regions si and tj. The EMD between sets S and T is then
EMD(S, T ) =
∑mi=1
∑nj=1 fijdist(si, tj)∑mi=1
∑nj=1 fij
, (3)
where F = [fij], with fij ≥ 0 the flow between si and tj, is the optimal admissible flow from
7
S to T that minimizes the numerator of (3) subject to the following constraints:
n∑
j=1
fij ≤ wsi,
m∑
i=1
fij ≤ wtj
m∑
i=1
n∑
j=1
fij = min(m∑
i=1
wsi,
n∑
j=1
wtj) .
Notice that the two sets can have different total weights. This allows for partial matches
[19]. The EMD results both in a distance measure and with the actual flow. Both are used
in our framework.
3.3 Combining the EMD distance with GMM representation
The EMD distance is combined with the GMM image representation in the BlobEMD frame-
work. The source and target sets (S and T ) are the blob sets (GMMs) per source and target
image and the EMD is used to find correspondences between the blobs, or regions. These cor-
respondences are optimal in the sense that they minimize the overall EMD distance (equation
3) between the images.
Figure 2 shows the bi-partite graph with which the EMD problem is defined and solved.
The source and target images yield two sets of blobs {s1...sn} and {t1...tm}. The source blobs
comprise the vertices of the left-hand side of the bi-partite graph. The target blobs comprise
the right-hand vertices of the graph. Note that each of the two images can be represented
by a different number of blobs. Each connecting arc is weighted by the ground-distance
between the corresponding source and target blob pair. This ground distance, dist(s, t),
can be defined in several ways. Here we use the Frechet distance [6] which is a closed-form
solution to the EMD in the case of two equal weight Gaussians and therefore is a natural
distance for the Gaussian blob representation (see Appendix A). In the EMD algorithm,
each vertex has a description and a weight. In our case the vertex description corresponds
to the feature vector (blob description) and the weight of a vertex is defined by the relative
weight of the corresponding Gaussian, in other words, the relative number of pixels that
correspond to the Gaussian (blob). The source and target weights determine how much flow
can be transferred from the source blob and to the target blob, respectively.
The EMD provides an optimal solution to the minimization problem defined on the bi-
partite graph, with the constraint that the maximum possible flow is transferred from the
8
Figure 2: Feature vectors (blob) correspondence using a fully-connected bi-partite graph
source to the target image. The generated solution yields the best match between source and
target blobs of the corresponding source and target images, along with an overall minimal
distance between the images, as defined by equation (3). Solving the minimization problem
results in a generated flow matrix. The flow matrix represents the amount of flow on each
arc of the fully-connected bi-partite graph. Examples of flow-matrices can be seen in Figures
8 - 11. The flow value is in the interval [0..1], where 0 indicates no flow exists through an
arc and 1 indicates that the entire weight of the source image is transferred through the arc
(this situation can occur in the trivial situation in which the source and the target images
consist of a single region each). The flow matrix shows the transformation of each blob in
the source image (rows) to blobs of the target image (columns).
3.4 Image model adaptation
Adaptation of an image model is useful when images are represented in inconsistent ways.
For example, under- and over-segmented images in the space domain or dithered images in
the color domain. The resultant flow-matrix is used next for context-dependent image model
adaptation.
Model adaptation can be applied in one of two possible adaptation modes: (1) Adapt
the representation model of a source image with respect to a second, target image, while
still maintaining similarity to the original model. Here only the source image represen-
9
tation is modified while the target image is unaffected. We hereon refer to this mode as
“source-to-target adaptation”; (2) Adapt both image models to reach the best common mu-
tual representation, keeping their similarities with the respective images. This mode will be
referred to as “mutual adaptation”.
The model adaptation is performed by an iterative process on the GMM models of the
two images by applying a series of merging and splitting steps on the source image GMM,
or on both the source and target image GMMs, depending on the mode used. The rules for
blob merging and blob splitting are based on the BlobEMD flow-matrix and are defined in
detail in the following section. In general, two blobs from one image will be considered for
merging if they flow (almost) entirely to a single blob in the other image. A blob will be
considered for splitting if it flows to several blobs in the other image, and these blobs also
receive flow from other blobs in the first image. Without the second condition the merging
rule would be applicable in the opposite direction - merge the blobs in the other image to
match the blob in the first image. Merging is always preferable over splitting to simplify the
resulting models.
4 Model Adaptation Rules
The candidate blobs for the merging and splitting are chosen based on the flow matrix that
results from the BlobEMD computation. Candidate blobs for a merge are characterized by
rows (or columns) with a single large value in the same column (or row) in the flow matrix
A candidate blob for splitting is characterized by a row (column) with multiple values such
that for each value, its respective column (row) contains additional non-zero entries.
For blobs in the candidate list to qualify for merging or splitting, three additional condi-
tions need to be met:
1. Similarity in feature space. The BlobEMD finds correspondences between all blobs in
the source and target images in a way that minimizes the global distance between the
two sets of blobs. However, since the EMD process is forced to match all blobs, it often
needs to compromise and match blobs, or parts of blobs, that are rather dissimilar from
each other. We require the respective candidate blobs in the two images to exhibit good
similarity in the feature space. For that we use the same ground distance GDF (·, ·),
10
that was used for the BlobEMD computation. In this work we usually use the Frechet
distance in L, a, b color space (see Appendix A). In the case of dithered images the
Frechet distance is used in x, y space (as will be shown in section 5.4).
2. Significant spatial overlapping. Even when respective candidate blobs are similar in the
feature space, they might not be spatially close enough. Merging and splitting require
significant spatial overlap of the blobs. For this purpose we define a second ground
distance, GDS(·, ·), which ignores the similarity in the feature space and measures only
the spatial overlap. We require that this measure returns zero when spatially, one blob
completely contains the other (i.e. a small blob inside a large blob). Given two blobs
s and t, consider the corresponding sets of pixels: {pi}pi∈s2σ and {pj}pj∈t2σ , where s2σ
and t2σ are the 2σ projections of the Gaussian blobs on the x, y plane (i.e., all the
pixels in the Gaussian blobs with Mahalanobis distance of 2σ). We define this distance
as
GDS(s, t) = 1− |{pi} ∩ {pj}|min(|{pi}|, |{pj}|) , (4)
where | · | represents the size of the group.
3. Significant flow. For a merge, we require that nearly all the weights of the candidate
blobs flow to the corresponding target blob. To split a candidate blob, we require that
the resulting blobs are not too small, i.e. the candidate blob has a significant flow to
the corresponding target blobs.
The conditions for the merge and split are summarized in Table 1. In the diagrams,
the weight of blob si is denoted by wsi, and the flow between source blob si and target
blob tj, by f(si, tj). The conditions involve several empirical thresholds that are application
and domain dependent (see examples in Section 5). Notice that for the spatial similarity
condition the threshold for the merge, CS1 is different than the threshold for the split, CS2 .
For the merge we demand that the target blob overlaps the two source blobs, while for the
split, we require only partial overlap. In general, CS2 < CS1 . This reasoning also applies to
the threshold of the significant flow condition. For the merge we want Cflow1 to be close to
1, meaning that all the weights of the source blobs flow to the target blob. For the split we
11
require each of the target blobs to carry a significant amount of the source blob, therefore,
Cflow2 < Cflow1 < 1.
Merge Split
Before operation
�������� ��� ��
��������
� �������� ! "#$&%')(*��������
��� ��
� ���������
������!
" #$%�&'�()
After operation
�� ��� � ���
�� �����
� � � ����� ���� � !" #��������
��� �� ��������
� ��������� !"�#$�%
& '()�*+�,-
Feature space similarityGDF (si, tk) < CF
GDF (sj, tk) < CF
GDF (si, tk) < CF
GDF (si, tl) < CF
Spatial similarityGDS(si, tk) < CS1
GDS(sj, tk) < CS1
GDS(si, tk) < CS2
GDS(si, tl) < CS2
Significant flowf(si, tk)/wsi
> Cflow1
f(sj, tk)/wsj> Cflow1
f(si, tk)/wsi> Cflow2
f(si, tl)/wsi> Cflow2
Table 1: Merge and split conditions.
The model adaptation process consists of several consecutive merging and splitting steps
conducted on the source and target images. Next we describe in detail the merging and
splitting steps. A description of the entire process will follow.
4.1 Blob merging
In the merging process the mixture model is updated, resulting in a smaller set of blobs and
updated feature characteristics. The process is an iterative one, passing through all merging
candidate lists, and finalizing when no additional merging is possible.
12
(a)
(b)
(c)
Figure 3: Synthetic example of source-to-target merging process. (a) A cross image is thesource image (left) that is matched to the target, line image (right); (b) Initial image models(representation layer); (c) Final image models following source model adaptation. Noticethat the two blobs in the source image that match the line in the target image were mergedtogether.
The merging process replaces pairs of blobs from the source image with a single new blob.
The new blob’s spatial position and statistics are based on the original source blobs. Given
two blobs: bi = (wi, µi, Σi) and bj = (wj, µj, Σj), the merged blob parameters b = (w, µ, Σ)
are calculated as follows:
w = wi + wj (5)
µ =wi
wµi +
wj
wµj (6)
Σ =wi
w(Σi + µiµ
ti) +
wj
w(Σj + µjµ
tj)− µµt (7)
The derivations of these equations can be found in Appendix B.
Figure 3 shows an example of the context-based merging process. An image of a cross is
the source image (left) that is matched to the target, an image of a line (right). The initial
image source models are shown in the center row, and the resulting image models, following
source model adaptation are shown bottom row. Perceptually, the image models look more
similar following the merging process.
13
4.2 Blob splitting
Splitting occurs, for example, in images with a large uniform background that is represented
by a single large blob, or when the segmentation process results in a small number of segments
(under-segmentation). Often, splitting blobs enables the blob parts to be merged with other
blobs in a follow-up merging process.
Hereon we term the set of target blobs to which the source blob flows to as the “target-
blobs” set. Once the target-blobs set is defined per source blob, we wish to split the source
blob into a set of smaller blobs, each corresponding to one of the target blobs in the set. The
splitting process is done as follows:
1. Randomly sample the source blob according to its Gaussian distribution.
2. Each sample x, is probabilistically affiliated with each target-blob distribution gj(x|θj),
j = 1, . . . , N .
3. For each target blob j, the set of M samples from the source blob of highest affiliation
to blob j is collected.
4. A Gaussian is learned for each set of M samples.
5. The source image mixture model is updated accordingly.
Figure 4 shows an example of source-to-target context-based splitting on synthetic im-
ages. The representation of the input image (top left) is updated according to the given
target image (top right). The input representation layer is shown in the center row. The
resulting output models, following one step of source model adaptation are shown bottom
row. Note that if mutual adaptation was pursued in this case, a merging of the target model
would have preceded the splitting of the source model.
4.3 The complete adaptation process
Figure 5 shows a flow-chart of the complete adaptation process. The process is an iterative
merging and splitting process. The adaptation process modifies either the source model or
both the source and the target models according to the adaptation mode (source-to-target
14
or mutual adaptation). The update loop terminates once no change is found in the source
model (source-to-target adaptation) or in both the source and the target models (mutual
adaptation mode). An optional post-processing step follows the main update loop. The post-
processing includes an additional source-target merging step followed by an intra-merging
step.
Intra-merging is an additional blob-merging step that is pursued in the mutual adaptation
mode, for each of the source and target models. It is an image smoothing filter. The blob
set of each image is checked for pairs of blobs of high similarity. Two blobs bi and bj within
an image may be merged if they are close in both feature space and have spatial similarity.
We use the following criteria: GDF (bi, bj) < 0.05, GDS(bi, bj) < 1.0. The intra-merging step
was found to be helpful in cases that result in many small blobs, i.e. the optimal match still
entailed a very large set of blobs (such a case may occur if we start with a large set of blobs
in each image). The outcome of the adaptation process is a set of newly segmented source
and target models with a final updated distance measure between them.
5 Experimental Results of the BlobEMD Framework
We have described the BlobEMD framework which consists of three main steps: First, each
input image is modeled as a Gaussian mixture distribution in the joint (L, a, b, x, y) feature
space. The EMD is next utilized for measuring similarity between the respective models of
two images. In addition to the similarity measure between sets of regions, the EMD also
returns the correspondence (flow) between them. The third step uses these correspondences
to adapt the source and target models according to the adaptation mode chosen.
In this section we present an investigative analysis of the BlobEMD framework. We start
with the combination of the first two steps: the GMM representation and the EMD distance
without the merging and splitting steps. We investigate the framework’s robustness in the
image matching task and its application to the image retrieval task. We next illustrate
the utilization of the flow-matrix for model adaptation within several application domains.
These include context-based image segmentation and dithered image matching.
15
5.1 Robustness to fragmentation in the image representation
Images with semantically similar content may be represented by differing number of regions
via the Gaussian mixture model (parameter k). The goal is to have images compared and
matched regardless of this variability and show robustness to it. In [10] we introduced a
novel intra-inter class statistical evaluation methodology as a benchmarking measure. The
intra-class set of images corresponds with similar content image samples, and the inter-class
set corresponds to pairing of images with different content. We use the inter-intra evaluation
scheme to evaluate the robustness of the BlobEMD framework to fragmentation in the image
representation.
In this experiment we use a random set of 245 images extracted from the COREL
database. The ground-truth is generated by choosing four mixture representations (4 val-
ues of k, k = 3, 4, 5, 6) per input image. The “intra-class” distance set is computed as the
distances between all combinations of representation models per image. Note that the simi-
larity of the models within the “intra-class” set is an objective one and does not depend on
subjective labeling. We have overall a set of 12 non-zero distances per image. This process
is repeated for each of the 245 images in the database for an overall 12 × 245 distances. A
second set of distances is computed across images, with each image represented by the MDL
chosen mixture representation (the optimal k value). We term this set of distances (with
245× 244 distances) the “inter-class” distance set.
A histogram of the “intra-class” and “inter-class” distances is plotted in each of the two
graphs presented in Figure 6. The graph on the left shows results in color only feature
space, while the graph on the right shows the distances between images when compared in a
combined color-space domain. Two distinct modes are present in both graphs, demonstrating
the clear separation between the sets. The “intra-class” distances are more narrowly spread
at the lower end of the axis (close to zero), as compared to the wide-spread and larger distance
values of the “inter-class” set. The results presented indicate the strong similarity between
same class models (same image with different values of k), regardless of the variability in
the representation. The BlobEMD framework is in fact robust to fragmentation in the
representation space.
16
5.2 Statistical performance evaluation
We next demonstrate the applicability of the presented framework to the image retrieval
task. In addition to the random set of 245 images an additional set of 70 images were
hand-picked as comprising 6 different classes or categories (10 images per class). Labeled
categories include: “car”, “desert”, “field”, “monkey” , “snow” and “waterfall”. Each image
in the database is processed to extract the localized Gaussian mixture representation. The
BlobEMD with the Frechet ground distance is next computed between each of the images
and an input query image. The images are sorted based on the distance and the closest ones
are presented as the retrieval results.
Retrieval results are evaluated by precision versus recall (PR) curves. Recall measures
the ability of retrieving all relevant or similar information items in the database. It is defined
as the ratio between the number of relevant or perceptually similar items retrieved and the
total relevant items in the database (in our case 10 relevant images per each of the labeled
classes). Precision measures the retrieval accuracy and is defined as the ratio between the
number of relevant or perceptually similar items and the total number of items retrieved.
Precision vs. recall (PR) curves are extracted for each of the 6 categories. A comparison
with global histogram representation and several histogram distance measures is conducted
as well as with our earlier work on the GMM-KL framework [10]. In the GMM-KL framework
the continuous KL distance is used to measure the distance between two continuous distri-
butions, the two GMMs representing the two image inputs. The definition of the continuous
KL distance is given in Appendix A. Histogram measures include the bin-to-bin Euclidean
distance (Euc.), the histogram intersection measure (H. I.) and the discrete KL measure
(Disc. KL) [23, 20, 17]. A binning of 8× 8× 8 is used in the histogram representation. This
resolution (512 quantization levels) is commonly found in the literature. This resolution
is also in the same order of magnitude (and favorably so) with the GMM representation.
Curves are presented in Figure 7. Each plot is an average of the results of the 10 query
images in the class.
We notice the following points:
1. In most cases retrieval results are better when using color only features (dashed black
line); slightly worse when adding spatial features (dashed red line). This fact is in
17
correspondence with earlier results as shown in Figure 6, and agrees with previous
works (e.g. [23]).
2. The BlobEMD framework provides very similar results to the GMM-KL framework. In
some cases the BlobEMD is better and in some of the cases the GMM-KL framework
gets better results. This behavioral pattern is to be expected as the two schemes are
closely related (with the advantage of the BlobEMD for model adaptation).
3. In all cases, the BlobEMD method provides better performance than histogram-based
methods.
5.3 Context-based image segmentation
In this and the following sections we focus on the model adaptation task. The challenge
of image segmentation is treated in this work as an image pair (source-target) task. An
image will be segmented differently based on the context as reflected by the target image.
The model adaptation is performed by an iterative process on the GMM models of the two
images and applying a series of merging and splitting steps on the source image GMM, or
on both the source and target images GMM, depending on the adaptation mode used. The
rules for blob merging and blob splitting are based on the BlobEMD flow-matrix, as defined
in Table 1. In the experiments presented in this section the following thresholds were used:
Merging rules thresholds: CF = 0.2, CS1 = 0.75 and Cflow1 = 0.6. Splitting rules thresholds:
CF = 0.2, CS2 = 0.75 and Cflow2 = 0.01. Thresholds were selected heuristically based on
experimentation.
In Figure 8 we illustrate context-based image model adaptation for adaptive segmentation
and image-pair matching on the Lake image example. In this example, similar semantic
content (“trees next to a lake”) is represented by a different number of regions and region
colors (a). The treetops are separate in one image and merged in the other, while the lake
appears as separate blobs in one image and as a single blob in the other. The initial source
and target image models are shown in (b) with the corresponding flow matrix shown in (c).
The updated source and target image segmentation maps, image models and corresponding
flow matrix are shown in (d), (e) and (f), respectively. Note the resemblance of the two
updated image models in (e) vs. the initial representation in (b). The context-based model
18
update results in updated image distances. In this example, the BlobEMD distance is 0.08
in the initial representation phase and 0.04 in the final representation. A decrease of 50% in
the distance is achieved via the update process.
A second example is shown in Figure 9. In (a) we show two similar images of a red car.
Due to different segmentation processes, they result in very different segmentations as shown
in (b) and (c), top. The corresponding GMM models are also significantly different as shown
in (b) and (c), bottom. Using the model adaptation process (source-to-target adaptation),
the final modeling and segmentation results are shown in (d). The region-correspondence
process along with merging and splitting, provides us with an updated model that results
in a segmentation that is very similar between the two images (compare (b) and (d)). Note
also that the model adaptation results in smoother regions and similar-looking object (car)
silhouettes.
5.4 Matching dithered images
Dithered images are images with reduced resolution in the color space, where due to limita-
tions of the display or printing device or because of a compression process, only a limited set
of discrete colors is used. The perceived color is based on our ability to blend the mixture, of
sometimes very different colors, into coherent colors which are not in the set of given colors,
such as in the example of the Monkey in Figure 10(a).
When a dithered image is modeled using only the limited set of colors, the resulting model
is very different from the model of the original, non-dithered image. Classical techniques
such as histograms fail to identify the similarly of the two models. Using the BlobEMD
framework we can adapt the dithered image representation according to the target image
representation and enable a comparison among them.
The following algorithm characteristics apply for dithered images: The similarity in fea-
ture space, GDF , is the Frechet ground distance on (x,y) space only. Here we don’t use the
color information for the ground distance as the distance between dithered image colors and
their original image colors may be large, while the mixture of the dithered colors may be
in close resemblance to the desired color at that location. The merging process in the color
feature space is thus critical in this application domain. The criteria for the merging process
is in the spatial domain. The blobs to be merged overlap in space (two colored blobs in
19
the dithered image overlap and flow to the same blob in the target image). The thresholds
used are the following. Merging rules thresholds: CF = 1.0, CS1 = 0.6 and Cflow1 = 0.6.
Splitting rules thresholds: CF = 0.2, CS2 = 0.75 and Cflow2 = 0.01. Thresholds were selected
heuristically based on experimentation.
Figure 10 shows an example of comparing between a target image (top left) and a dithered
version (27 colors) as a source (query) image (top right). A zoom-in window is shown
and clearly demonstrates the differences between the two input images. Source-to-target
adaptation is used. The initial models extracted for the two images are shown on the
bottom of (b) and (c), with the corresponding segmentation maps, (b) and (c), top. The
differences between the images are again evident in their respective models. Using the
BlobEMD framework enables a model adaptation process with a final updated model that
fits the source model both in color and spatial layout (d). Note the strong resemblance
between the models of (b) and (d), especially as contrasted with (c).
A second example is presented in Figure 11. The target image is shown top left and a
dithered version (27 colors) as a query image is shown top right. An extension to mutual
adaptation is shown in (d). Here the target image model is adapted as well for a final
result that is a more compact representation of both source and target images. The updated
representation results in updated image distances. In this example the BlobEMD distance
in the initial representation phase is 0.1. Following source-to-target adaptation the distance
reduces to 0.05. In the final mutual adaptation stage the distance value is 0.036. A decrease
of more than 50% in the distance is achieved via the update process.
6 Discussion
In this work we present the BlobEMD framework for a simultaneous solution to both the
image region correspondence problem and the estimation of an image pair distance. This
coupling allows for context-based model adaptation where the representation of one image
is adjusted based on the representation of a second image - the context.
We are presenting a different approach to the image segmentation problem. Rather than
trying to estimate the “true” segmentation of an image, the BlobEMD framework provides for
context dependent image segmentation. The segmentation problem is treated in conjunction
20
with the image matching problem. An image may be segmented differently in accordance
to the target image it is being compared to. Context based image segmentation and image
matching is enabled via the EMD flow.
In the BlobEMD framework the image is represented in the continuous domain using
GMM statistical modeling. The EMD optimization enable matching of individual model
components (Gaussians or blobs) while providing an overall distance measure between the
image distributions. There are interesting distinctions from earlier work: the image is rep-
resented via a continuous and probabilistic representation as opposed to the well-known
discrete histogram representation; Global image matching is achieved along with a correspon-
dence mapping of the individual representation components. This mapping is not available
in global matching techniques such as in the GMM-KL framework recently proposed.
A comparison between the two methods of the BlobEMD and the GMM-KL has been pre-
sented in the experimentation section. The results demonstrate a strong correlation between
the performance of the two approaches. The two approaches have the same representation
of the image space with a difference in the distance measures used for image matching. The
GMM-KL framework is a continuous probabilistic framework throughout, with the continu-
ous KL distance measure used for comparing statistically between two GMM distributions.
The BlobEMD framework provides the global distance measure along with an insight into
the correspondences found between individual mixture components, or image regions. This
mapping is essential for the model adaptation purposes and any other applications that rely
on region correspondences. The price payed for the inside view is a slight decrease in the
accuracy of the global distance measure.
An open theoretical issue for investigation is the definition of an appropriate Ground
distance for Gaussian, or blob comparison. Both the KL distance as well as the Frechet
distance are defined for equal-weight Gaussians. A challenge remains to find a more exact
mathematical formalism for the comparison between nonequal-weight Gaussians, as is the
case in-hand.
Using the BlobEMD framework, we solve the region correspondence problem across an
image pair. The correspondences between the regions in the two images are used to merge
and spit the regions so they still represent well the images but at the same time bring the
two representation to a common context. The suggested framework provides for image rep-
21
resentations that are more uniform and best aligned between the two images to be matched.
We view this work as a first step in an extensive research effort ahead in which we aug-
ment the region representation vector to include features such as texture, size and shape, in
addition to the color feature chosen here. A definition of an hierarchical matching framework
is under way. Region correspondences based on low-level features such as color and texture
may provide a semantically plausible image segmentation, thus enabling the extension of
the feature space to include high-level more semantic region characteristics, such as the in-
clusion of region sizes and shapes. In Figure 9 we see that the model adaptation results
in smoother regions and similar-looking object (car) silhouettes. The BlobEMD methodol-
ogy may provide the means for the much desired transition from regions to silhouettes and
shapes.
A Frechet ground distance
The Frechet distance is a special case of the Monge-Kantorovich mass transference problem
[18] which is the basis to the EMD. The general Monge-Kantorovich problem is defined as
inf{∫
U×Uc(s, t)P (ds, dt) : P ∈ P(P1, P2)
}, (8)
where P1 and P2 are two Borel probability measures given on a separate metric space (U, d),
and P(P1, P2) is the space of all Borel probability measures P on U × U with fixed margins
P1(·) = P (· × U) and P2(·) = P (U × ·). P1 and P2 are the initial and final distributions
and P is the optimal transference plan, or the flow as we use in this work. c(s, t) is the cost
function which in our work is the euclidian distance.
The Frechet distance[6] solves the general Monge-Kantorovich problem for the case where
s and t are normal distributions with means µs, µt, and covariance matrices Σs, Σt, respec-
tively.
d2 (s, t) = |µs − µt|2 + tr[Σs + Σt − 2 (ΣsΣt)
1/2]. (9)
It is a closed-form solution to the EMD in the case of two equal weight Gaussians and is
a natural distance for the Gaussian blob representation. Unfortunately, when two Gaussian
22
blobs have different weights, the Frechet distance is not valid. An extension for the non-equal
weights case is yet to be investigated.
B Merging blob statistics
Let bi = (wi, µi, Σi) and bj = (wj, µj, Σj) be two blobs to be merged, where wi, wj are the
weights, µi, µj the means, and Σi, Σj the covariance matrices of the blobs. We look for
the blob b = (w, µ, Σ) that represents the statistics of the union of the two sets of pixels
represented by the two blobs.
Let ni and nj be the number of pixels represented by blobs bi and bj respectively. We
have,
µi =1
ni
∑
p∈bi
p , µj =1
nj
∑
p∈bj
p .
Combining the two sets of pixels bi ∪ bj, we get the combined mean
µ =1
ni + nj
∑
p∈bi∪bj
p
=1
ni + nj
∑
p∈bi
p +∑
p∈bj
p
=1
ni + nj
(niµi + njµj)
=ni/n
(ni + nj)/nµi +
nj/n
(ni + nj)/nµj
=wi
wµi +
wj
wµj ,
where w = wi + wj.
Similarly for the covariance matrix we have
Σi =1
ni
∑
p∈bi
ppt − µiµti , Σj =
1
nj
∑
p∈bj
ppt − µjµtj .
Combining the two sets of pixels bi ∪ bj, we get the combined covariance
Σ =1
ni + nj
∑
p∈bi∪bj
ppt − µ2
=1
ni + nj
∑
p∈bi
ppt +∑
p∈bj
ppt
− µµt
23
=1
ni + nj
(ni(Σi + µiµ
ti) + nj(Σj + µjµ
tj)
)− µµt
=ni/n
(ni + nj)/n(Σi + µiµ
ti) +
nj/n
(ni + nj)/n(Σj + µjµ
tj)− µµt
=wi
w(Σi + µiµ
ti) +
wj
w(Σj + µjµ
tj)− µµt .
References
[1] J. R. Bach, C. Fuller, A. Gupta, A. Hampapur, B. Horowitz, R. Jain, and C.F. Shu. Virageimage search engine: an open framework for image management. In Jain R. (ed) Symposiumon Electronic Imaging:Science and Technology - Storage and Retrieval for Image and Videodatabases IV, volume IV, pages 76–87, 1996.
[2] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based image querying. In Proc. ofthe IEEE Workshop on Content-based Access of Image and Video libraries (CVPR’97), pages42–49, 1997.
[3] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation us-ing expectation-maximization and its application to image querying. IEEE Transactions onPattern Analysis and Machine Intelligence, 24:1026–1038, August 2002.
[4] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, 1991.
[5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the emalgorithm. J. Royal Statistical Soc. B, 39(1):1–38, 1977.
[6] D. C. Dowson and B. V. Landau. The frechet distance between multivariate normal distribu-tions. In Journal of Multivariate Analysis, volume 12, 1982.
[7] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley and SonsInc., 1973.
[8] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, and B. Dom et al. Query by imageand video content: the qbic system. IEEE Computer, 28(9):23–32, 1995.
[9] H. Greenspan, G. Dvir, and Y. Rubner. Region correspondence for image matching via emdflow. In Proceedings CVPR 2000 Workshop on Content-Based Access of Image and VideoLibraries, pages 27–31, 2000.
[10] H. Greenspan, J. Goldberger, and L. Ridel. A continuous probabilistic framework for imagematching. Computer Vision and Image Understanding, 84:384–406, December 2001.
[11] J. Hafner, H. Sawhney, W. Equitz, M. Flickner, and W. Niblacket. Efficient color histogramindexing for quadratic from distance functions. IEEE Trans. Pattern Analysis and MachineIntelligence, 17(7):729–739, 1995.
[12] F. L. Hitchcock. The distribution of a product from several sources to numerous localities. J.Math. Phys., 20:224–230, 1941.
24
[13] J. Huang, S. R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih. Image indexing using colorcorrelograms. In Proc. of the IEEE Comp. Vis. And Patt. Rec., pages 762–768, 1997.
[14] S. Kullback. Information theory and Statistics. Dover, 1968.
[15] W. Ma and B. Manjunath. Netra: A toolbox for navigating large image databases. In Pro-ceedings of IEEE Int. Conf. On Image Proc., pages 568–571, 1997.
[16] G. Pass and R. Zabih. Comparing images using joint histograms. Multimedia Systems, 7:234–240, 1999.
[17] Jan Puzicha, Yossi Rubner, Carlo Tomasi, and Joachim M. Buhmann. Empirical evaluation ofdissimilarity measures for color and texture. In IEEE International Conference on ComputerVision, pages 1165–1172, 1999.
[18] S. T. Rachev. The Monge-Kantorovich mass transference problem and its stochastic applica-tions. Theory of Probability and its Applications, XXIX(4):647–676, 1984.
[19] Yossi Rubner and Carlo Tomasi. Perceptual Metrics for Image Database Navigation. KluwerAcademic Publishers, Boston, December 2000.
[20] J. R. Smith. Integrated Spatial and Feature Image Systems: Retrieval, Analysis and Compres-sion. PhD thesis, Columbia University, 1997.
[21] J. R. Smith and S-F Chang. Integrated spatial and feature image query. Multimedia Systems,7:129–140, 1999.
[22] M. Stricker and A. Dimai. Spectral covariance and fuzzy regions for image indexing. MachineVision and Applications, 10(2):66–73, 1997.
[23] M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision,7(1):11–32, 1991.
[24] G. Wyszecki and W. Stiles. Color Science: Concepts and Methods, Quantitative Data andFormulae. Wiley, 1982.
25
(a)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.71
2
3
4
1 2 3
(b)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.71
2
3
4
5
1 2 3
(c)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.71
2
3
1 2 3
(d)
Figure 4: Synthetic example of source-to-target splitting process. (a) The source image (left)is matched to the target image (right); (b) Initial image models (representation layer) andtheir flow matrix from source to target; (c) Image models after splitting; (d) Final imagemodels after merging, following source model adaptation.
26
Figure 5: Model adaptation flow chart
(a) (b)
Figure 6: Statistical analysis of intra-class distances (black) vs. inter-class distances (white).(a) (L,a,b) feature space; (b) (L,a,b,x,y) feature space. The x-axis is the BlobEMD distanceand the y-axis is the frequency of occurrence of the respective distance in each of the twofeature spaces.
27
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall
field snow car
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cisi
on
Recall
desert monkey waterfall
Figure 7: Precision vs. Recall. 315 images in database. Each plot is an average of the resultsof the 10 query images in the class. In dashed colors are the BlobEMD results. Dashed blackis results of color only. Dashed red is results of color and x, y. Solid lines are for comparison.In black is the PR curve of the GMM-KL framework. The purple, red and green curvescorrespond to histogram representation and Euc., H. I., and Disc. KL distance measures,respectively.
28
(a)
(b)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.451
2
3
4
5
6
7
8
1 2 3 4 5 6
(c)
(d)
(e)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.51
2
3
4
5
1 2 3 4 5
(f)
Figure 8: Context-based image representation and matching via BlobEMD. (a) An imagepair example; (b) Source and target image models; (c) Corresponding flow matrix; (d)Updated source image segmentation map and target image segmentation map; (e) Updatedsource and target image models; (f) Updated flow matrix.
29
Target Image Source Image
(a)Target Image Source Image Context-based adaptation
(b) (c) (d)
Figure 9: Context-based model adaptation for segmentation. (a) Input images; (b) Tar-get image: segmentation map (top) and GMM representation (bottom); (c) Source image:segmentation map (top) and GMM representation (bottom); (d) Source image after context-based segmentation: adapted segmentation map (top) and adapted GMM representation(bottom).
30
Target image Source image4.0.jpg
(a)Target Image Source Image Context-based adaptation
(b) (c) (d)
Figure 10: Context-based model adaptation for dithered image representation. (a) Targetimage (left) and a dithered version (27 colors only) as a query image (right). A zoom inwindow is shown bottom; (b) and (c) Target and source image models are shown bottom,with the corresponding segmentation maps shown on top; (d) Final updated model usingsource-to-target adaptation.
31
(a)
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.11
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5
0.05
0.1
0.15
0.2
0.251
2
3
4
5
6
1 2 3 4 5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.351
2
3
4
1 2 3
(b) (c) (d)
Figure 11: Context-based model adaptation for dithered image representation. (a) Originalimages; (b) Initial images models; (c) Source model adaptation according to target model(source-to-target adaptation); (d) Mutual adaptation of both source and target models.
32