Statistical modeling and many-to-many matching for view-based 3D object retrieval

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Signal Processing: Image Communication

Signal Processing: Image Communication 25 (2010) 18–27

0923-59

doi:10.1

� Cor

E-m

qhdai@

(W. Xu)

journal homepage: www.elsevier.com/locate/image

Statistical modeling and many-to-many matching for view-based3D object retrieval

Fei Li, Qionghai Dai �, Wenli Xu, Guihua Er

Tsinghua National Laboratory for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China

a r t i c l e i n f o

Article history:

Received 26 September 2008

Received in revised form

17 October 2009

Accepted 18 November 2009

Keywords:

View-based 3D object retrieval

Statistical modeling

Markov chain

Many-to-many matching

Earth mover’s distance

65/$ - see front matter & 2009 Elsevier B.V. A

016/j.image.2009.11.001

responding author.

ail addresses: [email protected] (F.

mail.tsinghua.edu.cn (Q. Dai), [email protected]

, [email protected] (G. Er).

a b s t r a c t

We address the task of view-based 3D object retrieval, in which each object is

represented by a set of views taken from different positions, rather than a geometrical

model based on polygonal meshes. As the number of views and the view point setting

cannot always be the same for different objects, the retrieval task is more challenging

and the existing methods for 3D model retrieval are infeasible. In this paper, the

information in the sets of views is exploited from two aspects. On the one hand, the

form of histogram is converted from vector to state sequence, and Markov chain (MC) is

utilized for modeling the statistical characteristics of all the views representing the

same object. On the other hand, the earth mover’s distance (EMD) is involved to achieve

many-to-many matching between two sets of views. For 3D object retrieval, by

combining the above two aspects together, a new distance measure is defined, and a

novel approach to automatically determine the edge weights in graph-based semi-

supervised learning is proposed. Experimental results on different databases demon-

strate the effectiveness of our proposal.

& 2009 Elsevier B.V. All rights reserved.

1. Introduction

With the rapid development of 3D rendering andvisualization technologies, a large number of digital 3Dobjects are distributed widely on the Internet. This makesit an urgent need to organize and search the 3D objectseffectively and efficiently. The traditional text-basedmethods are impractical because of the large labelingcost and the subjectivity of human perception. Therefore,content-based 3D object retrieval [1,2] becomes an activeresearch topic, and has been largely explored in recentyears.

According to the generation schemes, a 3D object canbe represented by either a geometrical model based onpolygonal meshes or some views taken by a multi-camera

ll rights reserved.

Li).

ghua.edu.cn

array. For 3D model retrieval, many descriptors have beenproposed, and they can generally be classified into fourcategories [3]: primitive-based, statistics-based, geome-try-based, and view-based. All the descriptors have theirown advantages and limitations. As far as the retrievalperformance is concerned, according to many experi-mental results [3,4], the view-based methods, in which aset of 2D views are adopted for 3D model representation,often demonstrate their superiority to others.

In this paper, we address view-based 3D objectretrieval, in which each object is represented by a set ofviews taken from different positions. It seems that thetask is quite similar with the aforementioned 3D modelretrieval using view-based descriptors, but there aresignificant differences between the two. If the geometricalmodel of a 3D object is given, the exact view taken atarbitrary position can be acquired. However, if the originalrepresentation of a 3D object only consists of a groupof views, much less information is available, and theretrieval task is more challenging.

www.elsevier.com/locate/image

dx.doi.org/10.1016/j.image.2009.11.001

mailto:[email protected]





ARTICLE IN PRESS

F. Li et al. / Signal Processing: Image Communication 25 (2010) 18–27 19

Let us take the 3D model retrieval system based onlight field descriptor (LFD) [5,6] as an example, whosebasic idea is that if two 3D models are similar, they alsolook similar from all viewing angles. An LFD includes 10silhouettes of a 3D model captured from the vertices ofa dodecahedron over a hemisphere. Each 3D model isdescribed by 10 LFDs, which are created from differentcamera system orientations. Then the distance betweentwo models is determined by the minimal distanceobtained by rotating the viewing sphere of the LFDs ofone object with respect to the other’s. That is to say, thebasic idea of defining the distance is to find the one-to-one view matching for two objects. In 3D model retrieval,since we can obtain the needed silhouettes from themodel, the best matching can always be achieved andthe defined distance is often very effective. However, inthe case of view-based 3D object retrieval, the two sets ofviews corresponding to two objects may be captured fromdifferent view point settings, even the numbers of viewsin the two sets may not be the same. Therefore, thecorresponding view pairs cannot always be found for twoobjects, and the method in [5] is not feasible any more.

In some recent methods, the image sets are modeledby nonlinear manifolds, and the distance between twoview-based 3D objects can be determined by the distancebetween the two corresponding manifolds. Based on theanalysis of the tangent subspace, a distance is defined asthe combination of the distances in both high- and low-dimensional spaces [7]. As each set of views should bemapped into all the nonlinear subspaces spanned by eachof the other sets, the computational load is quite heavy. Inanother method, local linear subspaces are first con-structed from each manifold, then the distance betweentwo manifolds is determined by the closest subspace pair[8]. Its main limitation is that a large number of data areneeded to effectively calculate the geodesic distance, butthis is impractical in some cases.

In this paper, we present a combined framework forview-based 3D object retrieval. In order to implement aneffective retrieval system, the following two kinds ofinformation should be taken into account: the wholecharacteristics of each object, and the relationshipsamong the views of different objects. To exploit theinformation in the sets of views, statistical modeling andmany-to-many matching are introduced in our proposal.Based on quasi-histogram [9], the form of histogram isconverted from vector to state sequence, and Markovchain (MC) is utilized for modeling the set of viewscorresponding to the same object as a whole. Since findingthe one-to-one view matching for two objects is infea-sible, the relationship of many-to-many matching isintroduced, and the earth mover’s distance (EMD) [10]between two sets of views is calculated. As the methodsbased on MC and EMD exploit the histogram feature indifferent ways, it is more effective to combine themtogether. Two retrieval algorithms, namely nearest neigh-bor and graph-based semi-supervised learning, are con-sidered in this paper. For the former, we define a newdistance measure between two view-based 3D objects; forthe latter, we propose a robust approach to determine theedge weights. Experiments on different databases are

conducted to compare the performances of our proposaland the existing methods.

The rest of the paper is organized as follows: Section 2describes the combined distance measure. Section 3mainly talks about how to determine the edge weightsin graph-based semi-supervised learning. Our experimen-tal results are presented in Section 4, which is followed byconclusions and analysis of future work in Section 5.

2. Nearest neighbor for 3D object retrieval

The nearest neighbor algorithm is a simple and directmethod for 3D object retrieval. Given a query and adistance measure, the distances between the query and allthe objects in the database are calculated, and the objectswith the smallest distances are returned as the retrievalresults. Obviously, the adopted distance measure has adirect impact on the final retrieval performance. In thissection, a combined distance measure between two view-based 3D objects is proposed. It effectively considers twokinds of information, one is acquired from statisticalmodeling and the other from many-to-many matching.

2.1. Distance measure based on statistical modeling

2.1.1. Quasi-histogram and MC

Histogram is widely used in image analysis. It can beeasily calculated, and is a simple way to approximate thedistribution of a random variable. Many kinds of histo-grams have been proposed to describe different imagefeatures such as color, texture, shape and so on. To furthermine the information in histogram, we have proposed amethod to represent histogram in another form calledquasi-histogram, which can be thought as a statesequence of an MC [9]. The basic idea of the method isbriefly described below.

Let the M-bin histogram of an image be denoted asH¼ ½Hð1Þ;Hð2Þ; . . . ;HðMÞ�T . Then its corresponding quasi-histogram Hq ¼ fHqð1Þ;Hqð2Þ; . . . ;HqðKHÞg is defined as

HqðkÞ ¼min m : ~HðmÞZk

KH

� �; 1rkrKH; ð1Þ

where ~H ¼ ½ ~Hð1Þ; ~Hð2Þ; . . . ; ~HðMÞ�T is the cumulated histo-gram, whose element is calculated as ~HðmÞ ¼

Pmi ¼ 1

HðiÞ ð1rmrMÞ, and KH is the predefined length ofquasi-histogram. In general, for a given length KH , themore frequently m appears in Hq, the greater HðmÞ is; andvice versa. Hence, Hq can also reflect the distribution of arandom variable to a certain extent.

Unlike traditional methods, we consider Hq as a statesequence, rather than a vector. By treating each bin of theoriginal histogram as a random variable with exponentialdistribution, based on the premise that they are indepen-dent from each other, we have proved that Hq can becharacterized by an MC. Although the premise is violatedin some degree, satisfactory experimental results can beobtained by modifying the parameters of MC to takethe correlations between different bins into account. Formore details about quasi-histogram, the readers can bereferred to [9].

ARTICLE IN PRESS

F. Li et al. / Signal Processing: Image Communication 25 (2010) 18–2720

2.1.2. Distance measure based on MC

Suppose there are altogether V view-based 3D objectsin the database, and they are denoted by L¼ fL1; L2; . . . ;

LV g. Each object Lk ð1rkrVÞ is represented by a set ofviews Ik ¼ fIk1; Ik2; . . . ; Ik;Nk

g, where Nk is the number ofviews for representing Lk. The sets of histogram and quasi-histogram corresponding to I k are Hk ¼ fHk1;Hk2; . . . ;

Hk;Nkg and Hq

k ¼ fHqk1;H

qk2; . . . ;H

qk;Nkg, respectively. Under

the assumption that the quasi-histograms of all the viewscorresponding to a given 3D object are stochasticallygenerated by the same MC, we can train V MCs l1; l2; . . . ;

lV by the method in [9]. Each model lk ð1rkrVÞ

corresponds to the object Lk, and is trained by the quasi-histograms in the set Hq

k .After representing the view-based 3D objects by MCs,

the distance between two objects can be determinedby the distance between their corresponding MCs. Inour proposal, Kullback–Leibler divergence (KLD) [11] isadopted. If the lengths of all the state sequences gene-rated by MC are T, the KLD between two MCs lA and lB isdefined by

dKLDðlA; lBÞ ¼XO2ST

PðOjlAÞlogPðOjlAÞ

PðOjlBÞ; ð2Þ

where O is a state sequence, S is the state space of the MC,and ST denotes the set of all the state sequences with thelength of T.

Obviously, the computational load of enumeratingall the elements in ST is quite heavy. To deal with theproblem, the Monte-Carlo method is usually adopted.After independently generating n state sequences O1;O2;

. . . ;On, the KLD can be calculated approximately as

dKLDðlA; lBÞ �1

n

Xn

i ¼ 1

logPðOijlAÞ

PðOijlBÞ¼

1

n

Xn

i ¼ 1

½logðPðOijlAÞÞ-logðPðOijlBÞÞ�:

ð3Þ

In our proposal, the symmetric KLD is used, and thedistance between two objects LA and LB based on MC isdefined as

dMCðLA; LBÞ ¼12ðd

KLDðlA; lBÞþdKLDðlB; lAÞÞ: ð4Þ

By using quasi-histogram and MC, each view-based 3Dobject is treated as a whole, and the statistical character-istics of all the views representing the same object can bewell exploited.

2.2. Distance measure based on many-to-many matching

2.2.1. Basic idea of EMD

Based on the minimal cost to transform one distribu-tion into another, EMD is originally proposed to deter-mine the distance between two distributions [10]. Itintroduces many-to-many matching, and can deal withthe variable-length representations. Therefore, EMD oftendemonstrates effective performance, and has been widelyused in region-based image retrieval (RBIR) and otherfields [12,13].

Formally, suppose two signatures A and B are representedby fðrA1;vA1Þ; ðrA2;vA2Þ; . . . ; ðrA;NA

;vA;NAÞg and fðrB1;vB1Þ;

ðrB2;vB2Þ; . . . ; ðrB;NB;vB;NB

Þg, respectively, where NA and NB

are the numbers of clusters in A and B, rAi ð1r irNAÞ and

rBj ð1r jrNBÞ are the cluster representatives, vAi and vBj

are the weights of the corresponding clusters. Let thedistance between rAi and rBj be dði; jÞ. According to thedefinition in [10], the EMD between the two signaturescan be calculated as

dEMDðA;BÞ ¼minf ði;jÞ

PNA

i ¼ 1

PNB

j ¼ 1 f ði; jÞdði; jÞPNA

i ¼ 1

PNB

j ¼ 1 f ði; jÞ; ð5Þ

where f ði; jÞ denotes the flow between rAi and rBj subjectedto the following constraints:

f ði; jÞZ0; 1r irNA;1r jrNB;

XNB

j ¼ 1

f ði; jÞrvAi; 1r irNA;

XNA

i ¼ 1

f ði; jÞrvBj; 1r jrNB;

XNA

i ¼ 1

XNB

j ¼ 1

f ði; jÞ ¼minXNA

i ¼ 1

vAi;XNB

j ¼ 1

vBj

0@

1A:

According to the last constraint, the denominator ofthe cost function in Eq. (5) is a constant. Therefore, Eq. (5)is a problem of linear programming, and can be solved bymany existing algorithms.

2.2.2. Distance measure based on EMD

In view-based 3D object retrieval, the numbers ofviews corresponding to different objects may be different,so it is impossible for all the objects to have view-basedrepresentations of the same length. As EMD shows theadvantage in dealing with the variable-length representa-tions, it is adopted in our proposal to measure thedistance between two 3D objects.

For each view, a weight is introduced. Then the viewIkt ð1rkrV ;1rtrNkÞ can be represented by ðrkt ;vktÞ,where rkt is the visual feature of the view, and vkt is thecorresponding weight. Thus each 3D object Lk ð1rkrVÞ

can be represented by fðrk1;vk1Þ; ðrk2;vk2Þ; . . . ; ðrk;Nk;vk;Nk

Þg,which is in the form of a signature. In our proposal, anormalization constraint is imposed so that for 1rkrV ,the equation

PNk

t ¼ 1 vkt ¼ 1 always holds. Considering theconstraint, if the distance between rAi and rBj is denoted bydði; jÞ, the EMD between the two 3D objects LA and LB canbe calculated as

dEMDðLA; LBÞ ¼minf ði;jÞ

XNA

i ¼ 1

XNB

j ¼ 1

f ði; jÞdði; jÞ

24

35; ð6Þ

with the constraints

f ði; jÞZ0; 1r irNA;1r jrNB;

XNB

j ¼ 1

f ði; jÞ ¼ vAi; 1r irNA;

XNA

i ¼ 1

f ði; jÞ ¼ vBj; 1r jrNB:

ARTICLE IN PRESS


By introducing many-to-many matching automati-cally, EMD can be used to determine the distance betweentwo 3D objects effectively and robustly.

In our proposal, the histogram Hkt ð1rkrN;1rt

rNkÞ is used as the feature of each view. As it is oftendifficult to determine which view is more important for3D object representation, the same weight is assigned toevery view used for representing the same object. That isto say, we simply set vkt ¼ 1=Nk ð1rkrN; 1rtrNkÞ. Ifthe prior knowledge about the importance of each view isavailable, vkt can be set to other values.

If the number of views to represent a 3D object is large,the computational load for calculating EMD is quiteheavy. To deal with the problem, we use a suitableclustering algorithm to select characteristic views for 3Dobject representation. The existing algorithms in 3Dmodel retrieval using view-based descriptors, such asadaptive views clustering (AVC) [14], can be adopted.However, as the calculated EMD is robust to the clusteringresults, simpler methods can be used. In our proposal,hierarchical agglomerative clustering is adopted, and allthe views whose similarities are greater than a giventhreshold are treated as a whole.

2.3. Combined distance measure

In the distance measures defined in Sections 2.1.2 and2.2.2, histogram is treated as state sequence and vector,respectively. As the two methods exploit the histograminformation in different ways, it can be hoped that betterperformance will be obtained by combining theirresults. There are many ways for combination, and linearcombination is adopted in our proposal, which is simplebut has been demonstrated effective in the experiments.

The final distance measure between two view-based3D objects is defined as

dComðLA; LBÞ ¼ adMCðLA; LBÞ

Z1þð1-aÞ d

EMDðLA; LBÞ

Z2; ð7Þ

where Z1 and Z2 are normalization constants to ensure thevalues of dMCðLA; LBÞ and dEMDðLA; LBÞ are between 0 and 1,a ð0rar1Þ is a tunable parameter reflecting our confidenceon statistical modeling and many-to-many matching.

3. Graph-based semi-supervised learning for3D object retrieval

The basic idea of the nearest neighbor algorithm is simple.However, as the distance measure only considers the pair-wise relationship between two objects, sometimes itseffectiveness is limited. One of the promising approaches toexplore the relationships of all the objects in the database isgraph-based semi-supervised learning, and it has beensuccessfully introduced into image retrieval [15,16]. Thealgorithm starts with a graph, in which the verticescorrespond to all the labeled and unlabeled samples, andthe weighted edges reflect the similarities of the vertex pairs.Then, the available information is spread via the graph, fromlabeled samples to unlabeled ones, and the final spreadresults will be used for classification or ranking. As useful

information can be effectively exploited from both labeledand unlabeled samples, more satisfactory performance canalways be obtained.

Constructing an effective weighted graph is very im-portant for any graph-based method. In most of the existingmethods, the edge weights of the graph are calculated byGaussian function with the form of exp ½-d2ðLA; LBÞ=2s2�,where dð�; �Þ is a given distance measure between twoobjects. Therefore, we can simply use the proposedcombined distance in Section 2 for graph construction.However, as pointed out in [17], the parameter s caninfluence the results significantly, and there is no reliableapproach to determine it automatically. To address thisproblem, the idea of reconstructing each sample using alinear combination of its neighbors has been successfullyadopted in [17], but it can only deal with the cases when therepresentations of all the samples are with the same length.Hence, it cannot be directly introduced into view-based 3Dobject retrieval, because the numbers of views for repre-senting different objects may be different, which makes thelinear combination infeasible.

In this paper, a new method to determine the edgeweights in graph-based semi-supervised learning isproposed. As in Section 2, both statistical modeling andmany-to-many matching are taken into consideration.Two weighted graphs are constructed at first, one is basedon MC and the other is based on EMD, then they arecombined together to get the final graph.

3.1. Graph construction based on statistical modeling

Given the quasi-histogram sets corresponding to allthe 3D objects and the trained MCs, the average log-likelihood of the quasi-histograms in the set Hq

k ð1rk

rVÞ generated by the MC lc ð1rcrVÞ is calculated as

Fkc ¼1

Nk

XNk

i ¼ 1

logðPðHqkijlcÞÞ: ð8Þ

Then for each quasi-histogram set Hqk , a vector Fk ¼

½Fk1; Fk2; . . . ; FkV �T is obtained. It can be treated as a new

descriptor for the 3D object Lk, and can effectivelydescribe the relationships between the quasi-histogramset and all the statistical models.

In this way, the representations of all the 3D objectsare of the equal length, and the idea of linear reconstruc-tion by neighbors in [17] can be adopted. Formally,according to the distance measure dMC defined in Section2.1.2, the set of K nearest neighbors of object LA is denotedas KMC

ðLAÞ ¼ LA1; LA2

; . . . ; LAK

� �. Let the reconstruction

weight corresponding to LAkð1rkrKÞ be wMC

A;Ak, it can

be determined by

eA ¼minwMC

A;Ak

FA-XK

k ¼ 1

ðwMCA;Ak� FAkÞ

��

2

; ð9Þ


wMCA;Ak

Z0; 1rkrK;

XK

k ¼ 1

wMCA;Ak¼ 1:

ARTICLE IN PRESS


Then the weighted graph based on MC can be constructedby the matrix WMC , which is defined as

WMCij ¼

wMCij if Lj 2 KMC

ðLiÞ;

0 otherwise:

(ð10Þ

Note that usually we have WMCij aWMC

ji .When K{V , the computational complexity for solving

Eq. (9) is OðVK2Þ. Considering that the main purpose for

introducing the vector Fk ð1rkrVÞ is to make therepresentation of each 3D object of the same length, apredefined constant R ðRrVÞ is used to confine the lengthof the vector Fk, then the computational complexity canbe reduced to OðRK2

Þ.To reduce the length of the vector Fk, R representative

MCs lSið1r irRÞ are selected at first, then the vector Fk is

defined as Fk ¼ ½Fk;S1; Fk;S2

; . . . ; Fk;SR�T . In our proposal, the

criterion for MC selection is

J¼ maxS1 ;S2 ;...;SR

XR

i ¼ 1

XR

j ¼ 1

dKLDðlSi; lSjÞ

24

35: ð11Þ

That is to say, we want to find R MCs with the greatesttotal divergence. The reason for this criterion is straight-forward. If two MCs, for example lA and lB, are similar,the difference between the values of FkA and FkB will besmall. Therefore, there is no need to choose lA and lB atthe same time.

Given the distances of all the MC pairs, the most directway for MC selection is to enumerate all the combinationsof R MCs. The computational complexity of this algorithmis OðVRÞ, which is quite high for a large value of R. In ourproposal, a greedy algorithm is adopted. Its basic idea isthat at each step, we select the MC corresponding to thegreatest divergence with the MCs which have alreadybeen selected before. Although only the suboptimalsolution is obtained by the greedy algorithm, thecomputational complexity can be reduced to OðV2þVRÞ.As the main aim of MC selection is to confine the length ofthe vector Fk, the suboptimal solution is also acceptable.

To construct the weighted graph for 3D objectretrieval, we need to solve V similar optimizationproblems altogether. Therefore, the total complexity forgraph construction with MC selection is OðV2þVRþVRK2

Þ,which is quite lower than OðV2K2Þ, the complexitywithout MC selection.

3.2. Graph construction based on many-to-many matching

Suppose each 3D object Lk ð1rkrVÞ in the database isrepresented by fðrk1;vk1Þ; ðrk2;vk2Þ; . . . ; ðrk;Nk

;vk;NkÞg, then

the EMD between any two objects can be calculated. Fora given object LA, we find its K nearest neighbors based ondEMD, and denote them by KEMDðLAÞ ¼ fLB1

; LB2; . . . ; LBK

g. Asin [17], we also want to construct the weighted graph byexploring the relationship between LA and KEMDðLAÞ.However, since the objects may be represented by setsof views with different sizes, the method in [17] cannot beused directly. To deal with the problem, our basic idea isto treat all the objects in KEMDðLAÞ as a whole, and tocalculate the EMD between LA and KEMD

ðLAÞ.

For the object LBkð1rkrKÞ, let its weight in KEMD

ðLAÞ

be wEMDA;Bk

. When the objects in KEMDðLAÞ are considered at

the same time, the weight of its j-th ð1r jrNBkÞ view

becomes ðwEMDA;Bk� vBk ;jÞ. According to the normalization

constraint, the following equation should be held:

XK

k ¼ 1

XNBk

j ¼ 1

ðwEMDA;Bk� vBk ;jÞ ¼ 1: ð12Þ

As

XK

k ¼ 1

XNBk

j ¼ 1

ðwEMDA;Bk� vBk ;jÞ ¼

XK

k ¼ 1

wEMDA;Bk

XNBk

j ¼ 1

vBk ;j

¼XK

k ¼ 1

wEMDA;Bk

; ð13Þ

all the weights wEMDA;Bk

should satisfy the constraint

XK

k ¼ 1

wEMDA;Bk¼ 1: ð14Þ

Let the distance between rAi and rBk ;j be denoted as dkði; jÞ.

If the set of weights fwEMDA;B1

;wEMDA;B2

; . . . ;wEMDA;BKg is given before-

hand, the EMD between LA and KEMDðLAÞ can be calculated as

dEMDðLA;KEMDðLAÞÞ ¼min

fkði;jÞ

XK

k ¼ 1

XNA

i ¼ 1

XNBk

j ¼ 1

fkði; jÞdkði; jÞ

24

35; ð15Þ


fkði; jÞZ0; 1r irNA;1r jrNBk;1rkrK;

XK

k ¼ 1

XNBk

j ¼ 1

fkði; jÞ ¼ vAi; 1r irNA;

XNA

i ¼ 1

fkði; jÞ ¼wEMDA;Bk� vBk ;j; 1r jrNBk

;1rkrK:

In fact, the set of weights is unknown. With the idea to‘‘reconstruct’’ the object LA with the objects in KEMD

ðLAÞ,we hope to find the optimal weights, by which the EMDbetween LA and KEMD

ðLAÞ is minimized. Therefore, wEMDA;Bk

can be calculated by

ZA ¼minwEMD

A;Bk

½dEMDðLA;KEMDðLAÞÞ� ¼ min

fkði;jÞ;wEMDA;Bk

XK

k ¼ 1

XNA

i ¼ 1

XNBk

j ¼ 1

fkði; jÞdkði; jÞ

24

35;ð16Þ


fkði; jÞZ0; 1r irNA;1r jrNBk;1rkrK;

wEMDA;Bk

Z0; 1rkrK;

XK

k ¼ 1

XNBk

j ¼ 1

fkði; jÞ ¼ vAi; 1r irNA;

XNA

i ¼ 1

fkði; jÞ ¼wEMDA;Bk� vBk ;j; 1r jrNBk

;1rkrK;

XK

k ¼ 1

wEMDA;Bk¼ 1:

ARTICLE IN PRESS


Comparing Eq. (16) with Eq. (6), the main difference isthat the constraints in Eq. (16) contain new variableswEMD

A;Bkð1rkrKÞ. However, it is also a linear programming

problem, and can be solved efficiently. Generally speak-ing, to minimize the cost function in Eq. (16), we usuallyassign larger weight to more similar neighboring 3Dobject. As an extreme case, if there exists a 3D object LBi

2

KEMDðLAÞ with the same view-based representation as

LA, the optimal solution will be wEMDA;Bi

�

¼ 1, and wEMDA;Bj

�

¼

0 ðjaiÞ.Then the second weighted graph is constructed by the

matrix WEMD, which is defined similarly as WMC . As inSection 2.2.2, considering the computational load, WEMD canalso be calculated based on the results of view clustering.

3.3. Combined graph construction

The final weighted graph is obtained by combiningWMC and WEMD together, namely,

WCom ¼ bWMCþð1-bÞWEMD; ð17Þ

where b ð0rbr1Þ is with the same function as a inEq. (7).

Besides the linear combination coefficient b, if thevector Fk ð1rkrVÞ is defined without MC selection,the only parameter in our algorithm for graph construc-tion is K, the number of nearest neighbors. As shown inSection 4.2, when its value varies over a large range, ithas little effect on the final results. If MC selectionis involved, the additional parameter R, namely thenumber of selected MCs, can also hardly influence theretrieval performance, which is also demonstrated inthe experiments.

4. Experimental results

The proposed approaches for view-based 3D objectretrieval are evaluated on two kinds of databases: one isvirtual and constructed by taking views of 3D models inthe computer, and the other is composed of sets of viewsrepresenting real objects in the world.

To obtain the view-based representation of 3D models,we use two subsets of the ‘‘NTU 3D Model Database’’ [5]to construct virtual databases, and denote them as ‘‘DB1’’and ‘‘DB2’’ respectively. DB1 consists of 100 objects,

Circle 72 Dodeca 2

Fig. 1. Three camera settings for taking views of a 3D model from different posit

the virtual cameras. (a) Circle72, (b) Dodeca20 and (c) C60.

which are classified into 10 categories and each has 10objects; 250 objects are included in DB2, and they belongto 25 categories of 10 objects each.

We use virtual cameras in the 3D process software totake views of models from different positions. In ourexperiments, the virtual cameras are set in three differentways: (a) 72 cameras are set evenly on a circle over theobject; (b) 20 cameras are set on the vertices of a regulardodecahedron; (c) 60 cameras are set on the vertices of apolyhedron, which has the same structure with Buckmin-sterfullerene (C60). The three camera settings are denotedas ‘‘Circle72’’, ‘‘Dodeca20’’ and ‘‘C60’’, and they areillustrated in Fig. 1. Some example views taking by thevirtual cameras are shown in Fig. 2.

If the views of all the objects in a virtual database aretaken with the same camera setting, the database isdenoted as ‘‘DB1/DB2-camera setting’’ for convenience,such as DB1-Circle72, DB2-C60 and so on. To make thetask of retrieval more challenging, we also constructdatabases with mixed view-based representation, anddenote them as ‘‘DB1/DB2-Mixture’’ in the rest of thepaper. In these databases, five objects of each category arerepresented in the form of ‘‘Dodeca20’’, and the other fivein the form of ‘‘C60’’.

It should be noted that although the virtual databasesare constructed by 3D models, when the task of retrievalis conducted, only the information from the views isutilized.

For evaluating the performance of our proposal on areal-world database, the ETH-80 database [18] is adoptedin our experiments. It contains 80 objects from eightcategories. Each object is represented by 41 views spacedevenly over the upper viewing hemisphere, and theviewing positions are obtained by subdividing the facesof an octahedron to the third recursion level. More detailsabout the database can be found in [18].

The feature adopted in all the experiments is the18-dimensional edge direction histogram of each view.Based on the similar experiments as those in [9], thelength of the quasi-histogram, namely KH , is set to 12. Thecriteria used for performance evaluation include: (1)precision and recall, which denote the percentages ofthe relevant retrieved objects in the return set and therelevance set; (2) correct rate of nearest neighbor (CRNN),the percentage of the closest matches that belong to the

0 C60

ions. The pentagram in the figure denotes the object, and the dots denote

ARTICLE IN PRESS

Fig. 2. Some example views taken with the camera setting ‘‘Circle72’’.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

α

Pre

cisi

on

P5P10

Fig. 3. The retrieval performance of our proposal with different values of

a on DB1-Circle72.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

Our proposalHausdorffKLDOne−to−one

Fig. 4. Performance comparison by the nearest neighbor algorithm on

DB1-Circle72.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1

RecallP

reci

sion

Our proposalHausdorffKLD


DB2-Mixture.


same class as the query; (3) correct rate of first-tier(CRFT), the percentage of models in the query’s class thatappear within the top 10 matches. More details of thesecriteria can be found in [14]. Each 3D object in thedatabase is used as the query, and the average results areshown below.

4.1. 3D object retrieval by nearest neighbor

In this section, the retrieval experiments by the nearestneighbor algorithm are conducted. First, we discuss theinfluence of the parameter a in our proposed combineddistance. When it varies from 0 to 1, the retrievalperformance of our proposal on DB1-Circle72 is shownin Fig. 3, where Pk ðk¼ 5;10Þ denotes the precision of thefirst k returned 3D objects. Note that a¼ 0 or 1

corresponds to the case when only the informationbased on EMD or MC is used. As higher precision can beachieved with 0oao1, it is necessary to combine dMC

and dEMD together. The performance is the best whena¼ 0:8, which indicates that the exploited informationbased on MC is more important for this database.

Similar results can be obtained when the experimentsare conducted on other databases. As the results based onstatistical modeling and many-to-many matching areinfluenced by the number of views and the view pointsetting, the value of a corresponding to the best retrievalperformance is usually different for different databases.

We next compare the retrieval performances ofdifferent distance measures between two view-based 3Dobjects. The methods used for comparison includeHausdorff distance, the KLD between two Gaussiandistributions [19], the distance based on one-to-onematching [5], and our proposed combined distance.

The precision-recall curves of 3D object retrievalwith different distance measures on DB1-Circle72, DB2-Mixture and ETH-80 are shown in Figs. 4–6, respectively.It should be noted that as the numbers of views torepresent different objects are not always the same forDB2-Mixture, the distance based on one-to-one matchingcannot be used on this database. By comparing the curvescorresponding to the same distance in different figures,we can conclude that the retrieval performance on ETH-80 is the best, while the performance on DB2-Mixture isthe worst. The reason is that the visual features of 3Dobjects from the same category are more similar for ETH-80, and the mixed view-based representation makes theretrieval task more challenging on DB2-Mixture. Bycomparing the curves in each figure, we can see that ourproposed distance outperforms almost all the other

ARTICLE IN PRESS

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.450.500.550.600.650.700.750.800.850.900.95

1

Recall

Pre

cisi

on

Our proposalHausdorffKLDOne−to−one


ETH-80.

Table 1The CRNN and CRFT of 3D object retrieval with different distance

measures.

Database Distance CRNN CRFT

DB1-Circle72 Our proposal 0.810 0.601

DB1-Circle72 Hausdorff 0.690 0.514

DB1-Circle72 KLD 0.700 0.523

DB1-Circle72 One-to-one 0.730 0.507

DB2-Mixture Our proposal 0.664 0.376

DB2-Mixture Hausdorff 0.500 0.304

DB2-Mixture KLD 0.496 0.297

ETH-80 Our proposal 0.888 0.724

ETH-80 Hausdorff 0.850 0.676

ETH-80 KLD 0.888 0.668

ETH-80 One-to-one 0.925 0.718

Table 2

The P10 based on the weighted graph WMC with different values of R.

R 250 200 150 100 50 10

DB2-Dodeca20 0.389 0.392 0.390 0.390 0.379 0.377

DB2-C60 0.452 0.448 0.447 0.452 0.456 0.433

DB2-Mixture 0.318 0.317 0.311 0.312 0.305 0.291


methods on the three databases, for it can effectivelyutilize the information from statistical modeling andmany-to-many matching. The perfor-mances of both Hausdorff distance and KLD are bad, asthey cannot exploit useful information from the sets ofviews. The performance of the distance based on one-to-one matching varies greatly on DB1-Circle72 and ETH-80.The reason is analyzed as follows: when the views of each3D object are taken, both the camera setting and theorientation of the frontal view are fixed for every object inETH-80, while the objects in DB1-Circle72 may face in anydirection. Therefore, the corresponding view pairs of twoobjects can be found in ETH-80, but usually cannot inDB1-Circle72.

The CRNN and CRFT of 3D object retrieval withdifferent distance measures on the three databases areshow in Table 1. From the experimental results, we canmake the same conclusions as those based on theprecision-recall curves.

Although the distance based on one-to-one matchingmay achieve good retrieval performance in some cases,we cannot always find the corresponding view pairs inreal applications. It is even infeasible to use this distancewhen the number of views to represent each 3D object isdifferent. On the contrary, there is no constraint on the

camera setting for applying our proposed distance.Therefore, our proposal can be widely used in more fields.

4.2. 3D object retrieval by graph-based

semi-supervised learning

In this section, we use graph-based semi-supervisedlearning for 3D object retrieval, and compare differentmethods for constructing weighted graph. In all theexperiments, the manifold ranking algorithm [20,21] isadopted for spreading information. As similar results canbe obtained on different databases, only the retrievalperformances on DB2-Dodeca20, DB2-C60 and DB2-Mixture are illustrated here.

First, we talk about the relationship between thenumber of selected MCs for defining the vectorFk ð1rkrVÞ and the retrieval performance. Let thenumber of nearest neighbors for graph constructionbe 7. The retrieval precision based on the weighted graphWMC with different values of R is shown in Table 2. Fromthe table, we can conclude that the influence of theparameter R on the retrieval performance is not verysignificant. The reason is that the weighted graph isconstructed mainly by exploring the relationship betweeneach 3D object and its neighbors, rather than therepresentation of each object. As the sizes of our useddatabases are not very large, in the following experiments,the length of the vector Fk is set to the total number ofobjects in the database.

We next discuss the influence of the parameter b in thecombined weighted graph. When the number of nearestneighbors K is set to 7 and b varies from 0 to 1, theretrieval performances of the combined weighted graphon different databases are shown in Fig. 7. Similarly as theanalysis for the parameter a in our proposed distance,b¼ 0 or 1 corresponds to the case when only the graphWEMD or WMC is used. It can be seen that when 0obo1,the P10 is higher. Therefore, it is reasonable to considerboth statistical modeling and many-to-many matching atthe same time for graph construction. We can also noticethat the most suitable b for different databases isdifferent, which is similar as the cases for setting theparameter a. When the experiments with other values ofK are conducted, the same conclusion can be made.

Then the performance of our proposal to deter-mine the edge weights is evaluated, and the Gaussianfunction-based method is used for comparison. In theGaussian function, our proposed combined distance isadopted as the distance measure dð�; �Þ, in which theparameter a is set as the optimal value according to theexperiments. The retrieval results with different values of

ARTICLE IN PRESS


s on the three databases are shown in Fig. 8, from whichwe can see that the parameter influences the retrievalprecision significantly. Only when it lies around 0.03, cansatisfactory performance be obtained.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

σ

P10

DB2−Dodeca20DB2−C60DB2−Mixture

Fig. 8. The retrieval results with Gaussian function-based graph

construction.

3 4 5 6 7 8 9 10 110.25

0.3

0.35

0.4

0.45

0.5

K

P10


Fig. 9. The retrieval results with our proposal for graph construction.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.25

0.30

0.35

0.40

0.45

0.50

β

P10


Fig. 7. The retrieval performance of the combined weighted graph with

different values of b.

In the experiments with our proposal for graphconstruction, the parameter b is also set to the optimalvalue. Fig. 9 illustrates the retrieval precision withdifferent numbers of nearest neighbors on the threedatabases. Compared with Fig. 8, our method is morerobust and the retrieval precision is quite stable withrespect to the variation of the parameter K. Moreover, itsretrieval performance is comparable with the best resultof the Gaussian function-based method, which demon-strates the effectiveness of our proposal.

5. Conclusions and future work

In this paper, we mainly discuss the methods for view-based 3D object retrieval. Based on the idea of informationmining from both statistical modeling and many-to-manymatching, a combined distance measure between two3D objects is defined, and a robust way to determine theedge weights in graph-based semi-supervised learning isproposed. Both of our proposed approaches consist of twoparts, one corresponds to MC, and the other to EMD. Betterperformance can be achieved by combining the results of thetwo parts together. Experiments on different databasesdemonstrate that our proposal outperforms the existingmethods.

For future work, we will extend our proposal in thefollowing two aspects: (1) defining a more effectivedistance measure between two view-based 3D objectsaccording to the subspace analysis and manifold learning;(2) conducting more extensive experiments on databaseswith more 3D objects.

Acknowledgments

This work is supported by the National Basic ResearchProgram of China (973 Program, no. 2010CB731800) andthe National Natural Science Foundation of China (nos.60772048 and 60721003).

References

[1] B. Bustos, D.A. Keim, D. Saupe, T. Schreck, D.V. Vranic, Feature-based similarity search in 3D object databases, ACM ComputingSurveys 37 (4) (2005) 345–387.

[2] J.W.H. Tangelder, R.C. Veltkamp, A survey of content based 3Dshape retrieval methods, Multimedia Tools and Applications 39(2008) 441–471.

[3] A.D. Bimbo, P. Pala, Content-based retrieval of 3D models, ACMTransactions on Multimedia Computing, Communications andApplications 2 (1) (2006) 20–43.

[4] P. Shilane, P. Min, M. Kazhdan, T. Funkhouser, The Princeton shapebenchmark, in: Proceedings of the Shape Modeling Applications,2004, pp. 167–178.

[5] D.Y. Chen, X.P. Tian, Y.T. Shen, M. Ouhyoung, On visual similaritybased 3D model retrieval, Computer Graphics Forum (EURO-GRAPHICS) 22 (3) (2003) 223–232.

[6] Y.T. Shen, D.Y. Chen, X.P. Tian, M. Ouhyoung, 3D model searchengine based on lightfield descriptors, in: Proceedings of theEUROGRAPHICS, 2003.

[7] F. Wang, F. Li, Q. Dai, G. Er, View-based 3D object retrieval andrecognition using tangent subspace analysis, in: Proceedings of theSPIE Visual Communications and Image Processing, 68220I, 2008.

[8] R. Wang, S. Shan, X. Chen, W. Gao, Manifold–manifold distance withapplication to face recognition based on image set, in: Proceedings

ARTICLE IN PRESS


of the IEEE International Conference on Computer Vision andPattern Recognition, 2008.

[9] F. Li, Q. Dai, W. Xu, G. Er, Histogram mining based on Markov chainand its application to image categorization, Signal Processing:Image Communication 22 (9) (2007) 785–796.

[10] Y. Rubner, C. Tomasi, L.J. Guibas, The earth mover’s distance as ametric for image retrieval, International Journal of Computer Vision40 (2) (2000) 99–121.

[11] J. Kapur, H. Kesava, Entropy Optimization Principles with Applica-tions, Academic Press, New York, 1992.

[12] F. Jing, M. Li, H.J. Zhang, B. Zhang, Relevance feedback in region-based image retrieval, IEEE Transactions on Circuits and Systemsfor Video Technology 14 (5) (2004) 672–681.

[13] Y. Liu, D. Zhang, G. Lu, W.Y. Ma, A survey of content-based imageretrieval with high-level semantics, Pattern Recognition 40 (2007)262–282.

[14] T.F. Ansary, M. Daoudi, J.P. Vandeborre, A Bayesian 3-D searchengine using adaptive views clustering, IEEE Transactions onMultimedia 9 (1) (2007) 78–88.

[15] J. He, M. Li, H. J. Zhang, H. Tong, C. Zhang, Manifold-ranking basedimage retrieval, in: Proceedings of the ACM International Con-ference on Multimedia, 2004, pp. 9–16.

[16] J. He, M. Li, H.J. Zhang, H. Tong, C. Zhang, Generalized manifold-ranking-based image retrieval, IEEE Transactions on Image Proces-sing 15 (10) (2006) 3170–3177.

[17] F. Wang, C. Zhang, Label propagation through linear neighborhoods,IEEE Transactions on Knowledge and Data Engineering 20 (1)(2008) 55–67.

[18] B. Leibe, B. Schiele, Analyzing appearance and contour basedmethods for object categorization, Proceedings of the IEEE Inter-national Conference on Computer Vision and Pattern Recognition 2(2003) 409–415.

[19] G. Shakhnarovich, J.W. Fisher, T. Darrell, Face recognition fromlong-term observations, in: Proceedings of the European Confer-ence on Computer Vision, 2002, pp. 851–865.

[20] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Scholkopf, Learning withlocal and global consistency, Advances in Neural InformationProcessing Systems (2003) 321–328.

[21] D. Zhou, J. Weston, A. Gretton, O. Bousquet, B. Scholkopf, Rankingon data manifolds, Advances in Neural Information ProcessingSystems (2003) 169–176.

Documents

Statistical modeling and many-to-many matching for view-based 3D object retrieval