1
Connected Segmentation Tree -- A Joint Representation of Region Layout and Hierarchy Narendra Ahuja and Sinisa Todorovic {n-ahuja, sintod}@uiuc.edu CVPR 2008 MOTIVATION Objects in 3D world: comprised of other objects = parts different materials on their surfaces structure of surface materials have distinct locations & finite volume cohesive Their image projections: regions of contiguous pixels = 2D objects photometric properties: color, texture spatial layout and hierarchy of 2D objects layout and embedding of subregions within regions geometric properties: shape, area PROBLEM STATEMENT GIVEN images, each containing >= 0 occurrences of an object category IDENTIFY all regions occupied by the category in the image set LEARN a model of the category that JOINTLY captures: - Geometric properties (e.g., shape, area, relative displacements) - Photometric properties (e.g., intensity contrast along the boundary) - Embedding (or containment) relationships - Neighbor relationships and their strength of the regions identified to represent category occurrences. GIVEN a new image, use the learned category model to DETECT, RECOGNIZE, SEGMENT any occurrences of the category. bag of keypoints planar-graph models hierarchical models photometric spatial layout embedding modeling properties of regions occupied by 2D objects geometric Connected Segmentation Tree PRIOR WORK vs. PROPOSED OBJECT REPRESENTATION hierarchical models e.g., segmentation tree [1, 2] the same segmentation tree for different layouts = planar-graph models preselected same-scale parts bag of keypoints no spatial information hierarchy of adjacency graphs different connected segmentation trees for different layouts Proposed model: Connected Segmentation Tree OVERVIEW OF OUR APPROACH 1) Images = Connected segmentation trees (CST) Similar objects = Similar subgraphs 2) Similarity defined in terms of region properties: - Geometric and photometric - Strengths of region neighbor relationships and recursively the same properties of embedded subregions 3) Similar subgraphs found by using max-clique based graph matching 4) Maximally matching subgraphs are fused into a graph-union = CST category model 5) Matching the model with the CST of a new image: - Simultaneous recognition and segmentation all category occurrences - Explanation of recognition in terms of object parts and their neighbor relationships Boundary pixels and their corresponding standard Voronoi polygons for points Generalized Voronoi polygons for regions Union of Voronoi polygons of boundary pixels CONTRIBUTIONS -- GENERALIZED VORONOI DIAGRAM - Regions are neighbors if their generalized Voronoi polygons touch - Strength of neighborliness = Percentage of a Voronoi polygon's perimeter that is shared CONTRIBUTIONS -- DEFINITION OF REGION NEIGHBORLINESS 1) Regions are exposed to one other through their nearby boundary segments 2) If boundary parts of two regions are: - Visible to each other - Nearby The two regions are called neighbors - Sufficiently far from other region boundaries 3) Relative degrees of boundary exposure to neighbors = Strengths of a region's neighborliness 4) Region neighborliness is asymmetric CONTRIBUTIONS -- GENERALIZED MAX-CLIQUE GRAPH MATCHING 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.1 0.3 0.4 0.1 0.4 0.2 0.3 0.2 0.4 0.1 0.1 1 1 1 1 0.1 0.2 0.1 0.3 1 1 1 1 1 1 0.4 0.1 0.2 0.3 1 1 1 1 1 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.3 0.4 1 1 1 1 1 1 1 0.1 Problem: How to match graphs whose both edges and nodes are weighted? Properties of the proposed algorithm: - It matches regions with regions, and separately region spatial relationships with corresponding relationships - The maximum common subgraph, found by the algorithm, preserves the original structure of input graphs - It seeks legal region-region and relationship-relationship matches whose unary and pairwise potentials are large Theorem: Maximum subgraph isomorphism between two graphs with weighted edges and nodes Maximum weight clique of their generalized association graph G =(V,E ) G =(V ,E ) and Given two CST image representations: find structure-preserving bijection: f = {(v,v )} V × V which maximizes their similarity measure defined as S GG = max f (v,v )f ψ vv + (v,v ,u,u )f ×f φ vv uu unary potential = function of region intrinsic properties pairwise potential = function of region neighborliness and embedding EXAMPLE: CST MODEL OF WEIZMANN HORSES training set with positive and negative examples sample parts and their spatial relations in the learned CST model containment neighborliness REFERENCES [1] N. Ahuja and S. Todorovic, "Learning the taxonomy and models of categories present in arbitrary images," in ICCV 2007 [2] S. Todorovic and N. Ahuja, "Unsupervised category modeling, recognition and segmentation in images," in IEEE TPAMI, 2008 ACKNOWLEDGMENT The support of the National Science Foundation under grant NSF IIS 07-43014 is gratefully acknowledged RESULTS UIUC Hoofed Animals Dataset [1] (http://vision.ai.uiuc.edu/~sintod/datasets.html) Simultaneous recognition and segmentation of category "cow" represented by the CST model (middle row), and the segmentation-tree (ST) model [1, 2] (bottom row). CSTs outperform STs. CST ST LabelMe Trees LabelMe Buildings LabelMe Cars Weizmann Horses Horses Cows Deer Sheep Goats Camels Recall 47.6±6.9 92.6±6.9 67.6±6.9 91.9±5.2 81.2±10.3 78.4±4.2 88.1±6.9 81.2±5.3 78.2±8.6 89.9±7.2 Seg. error 41.6±7.9 34.6±13.4 32.5±8.2 7.2±2.5 15.9±5.3 17.1±4.6 11.1±8.4 24.8±7.2 20.1±8.1 11.5±5.1 Rec. error 19.7±3.8 11.6±2.9 12.9±4.8 7.9±4.1 7.8±4.2 6.5±6.2 7.7±3.4 7.8±4.1 12.2±5.4 3.2±3.9 Characteristics of recognition and segmentation: 1) Invariant to translation and in-plane rotation 2) Robust against a certain degree of: - Scale variations - Illumination changes - Object articulation - Partial occlusion - Background clutter 3) Segmentation is good when object boundaries are: - Jagged - Blurred - Form complex topologies 20 30 40 50 60 70 80 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 number of training images M n +M p recognition accuracy Recognition on Caltech!101: 4 categories M p =10, M n !, CST M n =10, M p !, CST M p =10, M n !, ST M n =10, M p !, ST positive negative 0.1 0.2 0.3 0.4 0.5 0.75 0.8 0.85 0.9 0.95 1 1!Precision Recall Caltech!101: Faces+Planes+Mbikes+Cars CST CST!ST CST, 20% ST, 20% random occlusion edges without weights random occlusion Simultaneous recognition and segmentation of category "car" LabelMe Dataset original ST CST

Connected Segmentation Tree -- A Joint Representation of ...web.engr.oregonstate.edu/~sinisa/talks/CST_CVPR08_Poster.pdf · Connected Segmentation Tree -- A Joint Representation of

  • Upload
    vucong

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Connected Segmentation Tree -- A Joint Representation of Region Layout and HierarchyNarendra Ahuja and Sinisa Todorovic

{n-ahuja, sintod}@uiuc.eduCVPR 2008MOTIVATION

Objects in 3D world:

comprised of other objects = parts

different materials on their surfacesstructure of surface materials

have distinct locations & finite volumecohesive

Their image projections:regions of contiguous pixels = 2D objects

photometric properties: color, texture

spatial layout and hierarchy of 2D objectslayout and embedding of subregions within regions

geometric properties: shape, area

PROBLEM STATEMENT

GIVEN images, each containing >= 0 occurrences of an object category

IDENTIFY all regions occupied by the category in the image set

LEARN a model of the category that JOINTLY captures:

- Geometric properties (e.g., shape, area, relative displacements)

- Photometric properties (e.g., intensity contrast along the boundary)

- Embedding (or containment) relationships

- Neighbor relationships and their strength

of the regions identified to represent category occurrences.

GIVEN a new image, use the learned category model to

DETECT, RECOGNIZE, SEGMENT any occurrences of the category.

bag of keypointsplanar-graph modelshierarchical models

photometric spatial layout embeddingmodeling properties of regionsoccupied by 2D objects geometric

✔ ✔

✔ ✔✘

✘ ✘ ✘

Connected Segmentation Tree ✔ ✔ ✔ ✔

PRIOR WORK vs. PROPOSED OBJECT REPRESENTATION

hierarchical modelse.g., segmentation tree [1, 2]

the same segmentation treefor different layouts

=

planar-graphmodels

preselected same-scale parts

bag of keypoints

no spatialinformation

≠ hierarchy of adjacency graphs

different connected segmentation trees for different layouts

Proposed model: Connected Segmentation Tree

OVERVIEW OF OUR APPROACH

1) Images = Connected segmentation trees (CST) ⇒ Similar objects = Similar subgraphs

2) Similarity defined in terms of region properties:

- Geometric and photometric

- Strengths of region neighbor relationships

and recursively the same properties of embedded subregions

3) Similar subgraphs found by using max-clique based graph matching4) Maximally matching subgraphs are fused into a graph-union = CST category model

5) Matching the model with the CST of a new image:

- Simultaneous recognition and segmentation all category occurrences

- Explanation of recognition in terms of object parts and their neighbor relationships

Boundary pixels and their corresponding standard Voronoi polygons for points

Generalized Voronoi polygons for regions

Union of Voronoi polygons of boundary pixels

CONTRIBUTIONS -- GENERALIZED VORONOI DIAGRAM

- Regions are neighbors if their generalized Voronoi polygons touch

- Strength of neighborliness = Percentage of a Voronoi polygon's perimeter that is shared

CONTRIBUTIONS -- DEFINITION OF REGION NEIGHBORLINESS

1) Regions are exposed to one other through their nearby boundary segments

2) If boundary parts of two regions are:

- Visible to each other

- Nearby The two regions are called neighbors

- Sufficiently far from other region boundaries

3) Relative degrees of boundary exposure to neighbors = Strengths of a region's neighborliness

4) Region neighborliness is asymmetric

!""""""#

""""""$

CONTRIBUTIONS -- GENERALIZED MAX-CLIQUE GRAPH MATCHING

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.1

0.3

0.4

0.1

0.4

0.2

0.3

0.2

0.4

0.1

0.1

1 1

11

0.1

0.2

0.1

0.3

1 1 1 1

11

0.4

0.1

0.2

0.3

1 1 1

1 1

0.4

0.3

0.3

0.2

0.2

0.1 0.1

0.3

0.4

1 1

1 1 1 1

1

0.1

Problem: How to match graphs whose both edges and nodes are weighted?

Properties of the proposed algorithm: - It matches regions with regions, and separately region spatial relationships with corresponding relationships

- The maximum common subgraph, found by the algorithm, preserves the original structure of input graphs

- It seeks legal region-region and relationship-relationship matches whose unary and pairwise potentials are large

Theorem: Maximum subgraph isomorphism between two graphs with weighted edges and nodes

Maximum weight clique of their generalized association graph

G = (V, E) G! = (V !, E!)andGiven two CST image representations:

find structure-preserving bijection: f = {(v, v!)} ! V " V !

which maximizes their similarity measure defined as

SGG! = maxf

!

"###

(v,v!)!f

!vv! +###

(v,v!,u,u!)!f"f

"vv!uu!

$

%

unary potential = function ofregion intrinsic properties

pairwise potential = function of region neighborliness and embedding

EXAMPLE: CST MODEL OF WEIZMANN HORSES648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

CVPR

#2328

CVPR

#2328CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) original image (b) STs (c) CSTsFigure 5. Samples from Hoofed Animals (top) and LabelMe (bot-

tom): CSTs outperform STs in detection and segmentation.

Qualitative evaluation – Model: Fig. 6 illustrates the

model of category horses G, learned on six, randomly se-lected training imagesD from theWeizmann dataset. Nodes

v in G, are depicted as rectangles that contain those re-gions in D that got matched with v during learning. Ascan be seen, the structure of G correctly captures the re-

cursive containment and neighbor relations of regions that

comprise the horses’ appearances in the training set. For

example, nodes “head,” “neck,” and “mane” are found to

be children of the node representing a larger “head&neck”

region, and they are all identified as neighbors. Also, G cor-rectly encodes that “head&neck” and “tail” are not neigh-

bors, and that they are positioned to the left and right of

“abdomen.” Frequently occurring, similar regions repre-

senting backgroundmay happen to be included in the model

if they consistently co-occur in the training set with regions

representing the category of interest. Thus, G contains fewnodes corresponding to “fence” identified as neighbor to the

horse’s “head.” Typically, these background nodes make a

small percentage of nodes in the model (3-5%), depending

on a specific training set.

Quantitative evaluation:

Regarding the comparison with prior work, there is very

limited past work on segmenting category instances. For

Weizmann horses, the best segmentation results are SE =7% [25] , and SE = 18.2% [5]. [25] uses semi-supervised

requiring training images to contain only horses, and cannot

handle occlusion. [19] accomplish with multiple segmenta-

tions modeled using LDA the following segmentation er-

rors: buildings SE = 0.47%, cars SE = 0.79%.Fig. 8 and Table 1–2 show detection, segmentation, and

recognition errors of CST. As can be seen, small increase in

negative examples Mn does not downgrade performance.

AsMn becomes larger, it so happens that in our training set

objects belonging to other categories start appearing more

frequently. Therefore, by our basic definition, these objects

become the target category. As a result, the algorithm now

correctly learns the new category instances, as expected.

Thus, with increase of Mn, the training set becomes inap-

Figure 6. CST-based model of Weizmann horses.

0.1 0.2 0.3 0.4 0.50.5

0.6

0.7

0.8

0.9

1

1!Precision

Rec

all

Caltech!101: Faces

H+WN

H+N

H

H, 10%

H, 20%

H+WN, 10%

H+WN, 20%

0.1 0.2 0.3 0.4 0.5

0.7

0.8

0.9

1

1!Precision

Rec

all

Caltech!101: Motorbikes

H+WN

H+N

H

H, 10%

H, 20%

H+WN, 10%

H+WN, 20%

Figure 7. Recall-precision curves: (H) hierachical relation-

ships only, (H+N) hierarchical and neighborhood relationships,

(H+WN) hierarchical and weighted neighborhood relationships;

(10%) and (20%) the size of a rectangular occlusion with respect

to image size.

Algorithm CST vs. ST CST vs. SL CST CST# positive imgs 10 10 20vs.10 30vs.10

Faces 8.5% 3.4% 1.7% 2.5%Horses 9.5% 4.1% 2.8% 2.9%

Table 2. Increase in the area under RPC

propriate. Increasing the number of positive training exam-

ples yields higher recognition recall and precision.

The performance of our approach has desirable invari-

7

training set with positive and negative examples

sample parts and their spatial relations in the learned CST model

containment

neighborliness

REFERENCES[1] N. Ahuja and S. Todorovic, "Learning the taxonomy and models of categories present in arbitrary images," in ICCV 2007

[2] S. Todorovic and N. Ahuja, "Unsupervised category modeling, recognition and segmentation in images," in IEEE TPAMI, 2008

ACKNOWLEDGMENTThe support of the National Science Foundation under grant NSF IIS 07-43014 is gratefully acknowledged

RESULTSUIUC Hoofed Animals Dataset [1] (http://vision.ai.uiuc.edu/~sintod/datasets.html)

Simultaneous recognition and segmentation of category "cow" represented by the CST model (middle row), and the segmentation-tree (ST) model [1, 2] (bottom row).

CSTs outperform STs.

CST

ST

in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Anchorage, AL, June 2008

LabelMe Trees LabelMe Buildings LabelMe Cars Weizmann Horses Horses Cows Deer Sheep Goats Camels

Recall 47.6±6.9 92.6±6.9 67.6±6.9 91.9±5.2 81.2±10.3 78.4±4.2 88.1±6.9 81.2±5.3 78.2±8.6 89.9±7.2

Seg. error 41.6±7.9 34.6±13.4 32.5±8.2 7.2±2.5 15.9±5.3 17.1±4.6 11.1±8.4 24.8±7.2 20.1±8.1 11.5±5.1

Rec. error 19.7±3.8 11.6±2.9 12.9±4.8 7.9±4.1 7.8±4.2 6.5±6.2 7.7±3.4 7.8±4.1 12.2±5.4 3.2±3.9

Table 1. Detection recall, segmentation and recognition errors (in %) using the same number of training and test images as in [?, ?, ?].

[19] S. Todorovic and N. Ahuja. Extracting subimages of an un-known category from a set of images. In CVPR, volume 1,pages 927–934, 2006.

[20] A. Torsello and E. R. Hancock. Computing approximate treeedit distance using relaxation labeling. Pattern Recogn. Lett.,24(8):1089–1097, 2003.

[21] J. Winn and N. Jojic. Locus: learning object classes withunsupervised segmentation. In ICCV, pages 756–763, 2005.

8

Characteristics of recognition and segmentation:1) Invariant to translation and in-plane rotation

2) Robust against a certain degree of:

- Scale variations

- Illumination changes

- Object articulation

- Partial occlusion

- Background clutter

3) Segmentation is good when object boundaries are:

- Jagged

- Blurred

- Form complex topologies

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

CVPR

#2328CVPR

#2328CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

LabelMe Trees LabelMe Buildings LabelMe Cars Weizmann Horses Horses Cows Deer Sheep Goats Camels

Recall 47.6±6.9 92.6±6.9 67.6±6.9 91.9±5.2 81.2±10.3 78.4±4.2 88.1±6.9 81.2±5.3 78.2±8.6 89.9±7.2

Seg. error 41.6±7.9 34.6±13.4 32.5±8.2 7.2±2.5 15.9±5.3 17.1±4.6 11.1±8.4 24.8±7.2 20.1±8.1 11.5±5.1

Rec. error 19.7±3.8 11.6±2.9 12.9±4.8 7.9±4.1 7.8±4.2 6.5±6.2 7.7±3.4 7.8±4.1 12.2±5.4 3.2±3.9

Table 1. Detection recall, segmentation and recognition errors (in %) using the same number of training and test images as in [16, 22, 5].

0.1 0.2 0.3 0.4 0.50.75

0.8

0.85

0.9

0.95

1

1!Precision

Recall

Caltech!101: Faces+Planes+Mbikes+Cars

CSTCST!unweightSTCST, 20%ST, 20%

20 30 40 50 60 70 800.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of training images Mn+M

p

reco

gn

itio

n a

ccu

racy

Recognition on Caltech!101: 4 categories

Mp=10, M

n!, CST

Mn=10, M

p!, CST

Mp=10, M

n!, ST

Mn=10, M

p!, ST

Figure 7. (left)Detection recall-precision curves: “CST-unweight”

means that edges in CST are not weighted. 20% is the size of a

rectangular occlusion w.r.t. the image size. Mp=10,Mn=10. ST

is the method of [20]. (right) Recognition accuracy of CST and

ST for the varying ratio ofMp andMn in the training set.

occurrences in the training set, while subgraphs of visible

object parts in the CST of a test image can still be matched

with the model; (iv) minor depth rotations of objects caus-

ing their shape deformations, because structural instability

of CSTs (e.g., due to region splits/mergers) is accounted for

during matching; and (v) clutter, since clutter regions are

not frequent and thus not learned.

6. Conclusion

We have presented what we believe is the first attempt

to jointly learn a canonical model of an object in terms

of photometric and geometric properties, and containment

and neighbor relationships of its parts. As other funda-

mental contributions, the paper proposes: (1) A general-

ized Voronoi diagram to represent region adjacency; and (2)

A new max-clique based algorithm for matching weighted

graphs, and thus learning the model.

References

[1] N. Ahuja. Dot pattern processing using Voronoi neighbor-

hoods. TPAMI, 4(3):336–343, 1982. 3

[2] N. Ahuja. A transform for multiscale image segmenta-

tion by integrated edge and region detection. IEEE TPAMI,

18(12):1211–1235, 1996. 3

[3] N. Ahuja and S. Todorovic. Extracting texels in 2.1D natural

textures. In ICCV, 2007. 1, 2, 3, 4

[4] G. Bouchard and B. Triggs. Hierarchical part-based visual

object categorization. InCVPR, vol. 1, pp. 710–715, 2005. 3

[5] L. Cao and L. Fei-Fei. Spatially coherent latent topic model

for concurrent object segmentation and classification. In

ICCV, 2007. 7, 8

[6] G. Carneiro and D. Lowe. Sparse flexible models of local

features. In ECCV, vol. 3, pp. 29–43, 2006. 3

[7] H. Chen, Z. J. Xu, Z. Q. Liu, and S. C. Zhu. Composite

templates for cloth modeling and sketching. InCVPR, vol. 1,

pp. 943–950, 2006. 3

[8] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial

priors for part-based recognition using statistical models. In

CVPR, vol. 1, pp. 10–17, 2005. 2

[9] R. Fergus, P. Perona, and A. Zisserman. Object class recog-

nition by unsupervised scale-invariant learning. In CVPR,

vol. 2, pp. 264–271, 2003. 2

[10] S. Fidler and A. Leonardis. Towards scalable representations

of object categories: Learning a hierarchy of parts. InCVPR,

2007. 3, 7

[11] R. Glantz, M. Pelillo, and W. G. Kropatsch. Matching seg-

mentation hierarchies. Int. J. Pattern Rec. Artificial Intelli-

gence, 18(3):397–424, 2004. 4

[12] Y. Jin and S. Geman. Context and hierarchy in a probabilistic

image model. In CVPR, vol. 2, pp. 2145–2152, 2006. 3

[13] A. Levinshtein, C. Sminchisescu, and S. Dickinson. Learn-

ing hierarchical shape models from examples. In EMM-

CVPR, Springer LNCS, vol. 3757, pp. 251–267, 2005. 3

[14] B. Ommer and J. M. Buhmann. Learning the compositional

nature of visual objects. In CVPR, 2007. 3

[15] M. Pelillo, K. Siddiqi, and S. W. Zucker. Matching hier-

archical structures using association graphs. IEEE TPAMI,

21(11):1105–1120, 1999. 4, 5

[16] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and

A. Zisserman. Using multiple segmentations to discover ob-

jects and their extent in image collections. In CVPR, vol. 2,

pp. 1605–1604, 2006. 7, 8

[17] T. B. Sebastian, P. N. Klein, and B. B. Kimia. Recognition

of shapes by editing their shock graphs. IEEE Trans. Pattern

Anal. Machine Intell., 26(5):550–571, 2004. 4

[18] A. Shokoufandeh, L. Bretzner, D. Macrini, M. Demirci,

C. Jonsson, and S. Dickinson. The representation and match-

ing of categorical shape. Computer Vision Image Under-

standing, 103(2):139–154, 2006. 3, 4

[19] K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, and S. W.

Zucker. Shock graphs and shape matching. Int. J. Comput.

Vision, 35(1):13–32, 1999. 4

[20] S. Todorovic and N. Ahuja. Extracting subimages of an un-

known category from a set of images. In CVPR, vol. 1, pp.

927–934, 2006. 1, 2, 3, 4, 6, 8

[21] A. Torsello and E. R. Hancock. Computing approximate tree

edit distance using relaxation labeling. Pattern Recogn. Lett.,

24(8):1089–1097, 2003. 4

[22] J. Winn and N. Jojic. Locus: learning object classes with

unsupervised segmentation. In ICCV, vol. 1, pp. 756–763,

2005. 7, 8

8

positivenegative

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

CVPR

#2328CVPR

#2328CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

LabelMe Trees LabelMe Buildings LabelMe Cars Weizmann Horses Horses Cows Deer Sheep Goats Camels

Recall 47.6±6.9 92.6±6.9 67.6±6.9 91.9±5.2 81.2±10.3 78.4±4.2 88.1±6.9 81.2±5.3 78.2±8.6 89.9±7.2

Seg. error 41.6±7.9 34.6±13.4 32.5±8.2 7.2±2.5 15.9±5.3 17.1±4.6 11.1±8.4 24.8±7.2 20.1±8.1 11.5±5.1

Rec. error 19.7±3.8 11.6±2.9 12.9±4.8 7.9±4.1 7.8±4.2 6.5±6.2 7.7±3.4 7.8±4.1 12.2±5.4 3.2±3.9

Table 1. Detection recall, segmentation and recognition errors (in %) using the same number of training and test images as in [16, 22, 5].

0.1 0.2 0.3 0.4 0.50.75

0.8

0.85

0.9

0.95

1

1!Precision

Recall

Caltech!101: Faces+Planes+Mbikes+Cars

CSTCST!unweightSTCST, 20%ST, 20%

20 30 40 50 60 70 800.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

number of training images Mn+M

p

recognitio

n a

ccura

cy

Recognition on Caltech!101: 4 categories

Mp=10, M

n!, CST

Mn=10, M

p!, CST

Mp=10, M

n!, ST

Mn=10, M

p!, ST

Figure 7. (left)Detection recall-precision curves: “CST-unweight”

means that edges in CST are not weighted. 20% is the size of a

rectangular occlusion w.r.t. the image size. Mp=10,Mn=10. ST

is the method of [20]. (right) Recognition accuracy of CST and

ST for the varying ratio ofMp andMn in the training set.

occurrences in the training set, while subgraphs of visible

object parts in the CST of a test image can still be matched

with the model; (iv) minor depth rotations of objects caus-

ing their shape deformations, because structural instability

of CSTs (e.g., due to region splits/mergers) is accounted for

during matching; and (v) clutter, since clutter regions are

not frequent and thus not learned.

6. Conclusion

We have presented what we believe is the first attempt

to jointly learn a canonical model of an object in terms

of photometric and geometric properties, and containment

and neighbor relationships of its parts. As other funda-

mental contributions, the paper proposes: (1) A general-

ized Voronoi diagram to represent region adjacency; and (2)

A new max-clique based algorithm for matching weighted

graphs, and thus learning the model.

References

[1] N. Ahuja. Dot pattern processing using Voronoi neighbor-

hoods. TPAMI, 4(3):336–343, 1982. 3

[2] N. Ahuja. A transform for multiscale image segmenta-

tion by integrated edge and region detection. IEEE TPAMI,

18(12):1211–1235, 1996. 3

[3] N. Ahuja and S. Todorovic. Extracting texels in 2.1D natural

textures. In ICCV, 2007. 1, 2, 3, 4

[4] G. Bouchard and B. Triggs. Hierarchical part-based visual

object categorization. InCVPR, vol. 1, pp. 710–715, 2005. 3

[5] L. Cao and L. Fei-Fei. Spatially coherent latent topic model

for concurrent object segmentation and classification. In

ICCV, 2007. 7, 8

[6] G. Carneiro and D. Lowe. Sparse flexible models of local

features. In ECCV, vol. 3, pp. 29–43, 2006. 3

[7] H. Chen, Z. J. Xu, Z. Q. Liu, and S. C. Zhu. Composite

templates for cloth modeling and sketching. InCVPR, vol. 1,

pp. 943–950, 2006. 3

[8] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial

priors for part-based recognition using statistical models. In

CVPR, vol. 1, pp. 10–17, 2005. 2

[9] R. Fergus, P. Perona, and A. Zisserman. Object class recog-

nition by unsupervised scale-invariant learning. In CVPR,

vol. 2, pp. 264–271, 2003. 2

[10] S. Fidler and A. Leonardis. Towards scalable representations

of object categories: Learning a hierarchy of parts. InCVPR,

2007. 3, 7

[11] R. Glantz, M. Pelillo, and W. G. Kropatsch. Matching seg-

mentation hierarchies. Int. J. Pattern Rec. Artificial Intelli-

gence, 18(3):397–424, 2004. 4

[12] Y. Jin and S. Geman. Context and hierarchy in a probabilistic

image model. In CVPR, vol. 2, pp. 2145–2152, 2006. 3

[13] A. Levinshtein, C. Sminchisescu, and S. Dickinson. Learn-

ing hierarchical shape models from examples. In EMM-

CVPR, Springer LNCS, vol. 3757, pp. 251–267, 2005. 3

[14] B. Ommer and J. M. Buhmann. Learning the compositional

nature of visual objects. In CVPR, 2007. 3

[15] M. Pelillo, K. Siddiqi, and S. W. Zucker. Matching hier-

archical structures using association graphs. IEEE TPAMI,

21(11):1105–1120, 1999. 4, 5

[16] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, and

A. Zisserman. Using multiple segmentations to discover ob-

jects and their extent in image collections. In CVPR, vol. 2,

pp. 1605–1604, 2006. 7, 8

[17] T. B. Sebastian, P. N. Klein, and B. B. Kimia. Recognition

of shapes by editing their shock graphs. IEEE Trans. Pattern

Anal. Machine Intell., 26(5):550–571, 2004. 4

[18] A. Shokoufandeh, L. Bretzner, D. Macrini, M. Demirci,

C. Jonsson, and S. Dickinson. The representation and match-

ing of categorical shape. Computer Vision Image Under-

standing, 103(2):139–154, 2006. 3, 4

[19] K. Siddiqi, A. Shokoufandeh, S. J. Dickinson, and S. W.

Zucker. Shock graphs and shape matching. Int. J. Comput.

Vision, 35(1):13–32, 1999. 4

[20] S. Todorovic and N. Ahuja. Extracting subimages of an un-

known category from a set of images. In CVPR, vol. 1, pp.

927–934, 2006. 1, 2, 3, 4, 6, 8

[21] A. Torsello and E. R. Hancock. Computing approximate tree

edit distance using relaxation labeling. Pattern Recogn. Lett.,

24(8):1089–1097, 2003. 4

[22] J. Winn and N. Jojic. Locus: learning object classes with

unsupervised segmentation. In ICCV, vol. 1, pp. 756–763,

2005. 7, 8

8

random occlusion

edges without weights

random occlusion

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

CVPR

#2328

CVPR

#2328

CVPR 2008 Submission #2328. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) original image (b) STs (c) CSTsFigure 5. CSTs outperform STs in both detection and segmentation

on samples from Hoofed Animals (top) and LabelMe (bottom).

Undetected image parts are masked out.

randomly selected images D from the Weizmann dataset.

Nodes v in G, depicted as rectangles, contain regions fromD that got matched with v during learning. As can be

seen, the structure of G correctly captures the recursive con-tainment and neighbor relations of regions occupied by the

horses in D. For example, nodes head, neck, and mane

are found to be children of node head&neck, and they are

all identified as neighbors. Also, G correctly encodes thathead&neck and tail are not neighbors. Similar background

regions that consistently co-occur with regions represent-

ing the category (horses) in D may also be included in the

model. For example, G contains few nodes correspondingto fence identified as neighbor to the horse’s head. Typi-

cally, these background nodes make a small percentage of

nodes in the model (3-5%), depending on a specific training

set.

Quantitative evaluation: Fig. 7 (left) presents the recall-

precision curves (RPC) of detection for the Caltech-101 cat-

egories using CSTs and STs. Detection performance in the

presence of occlusion is tested by masking out a randomly

selected rectangular area in the image, and replacing this

area with a patch from the background category of Caltech-

101. CST increases the area under the RPC of ST by 6.5%,

and by 3.1% in the presence of the occluding patch covering

20% of the image. Invariance to in-plane rotation is tested

by randomly rotating test images, as illustrated in Fig. 1b.

Performance on these rotated images is the same as the one

presented in Fig. 7. Measuring the strength of neighborli-

ness using the generalized Voronoi diagram improves per-

formance over the case when the weights of links in CST are

set to take only values 1 or 0, referred to as CST-unweight.

CST increases the area under the RPC of CST-unweight by

2.2%. Fig. 7 (right) shows recognition accuracy of CST and

ST. A small increase in Mn does not downgrade the accu-

racy. As Mn becomes larger, objects belonging to other

categories start appearing more frequently, and thus get

learned, making the training set inappropriate. Increasing

Mp yields smaller recognition error. CST outperforms ST

Figure 6. CST-based model of Weizmann horses.

in recognition, and longer maintains high accuracy with the

increase of Mn. Table 1 summarizes detection recall, and

segmentation and recognition errors obtained for the equal

error rates on LabelMe and Hoofed Animals datasets. For

Hoofed Animals, CST outperforms ST in detection recall

by 7.5%, segmentation by 10.7%, and recognition by 8.6%.

For comparison, we obtained SE=6.5% on a relatively sim-

ple UIUC (multiscale) car dataset, using the same set-up as

in [10], while their result is SE=7.9%. The other hierarchi-cal approaches cited here use non-benchmark datasets, or

report a single retrieval result for the entire Caltech-101, be-

yond the focus of this paper. Non-hierarchical approaches

to modeling object categories, which ignore the contain-

ment of object parts, and use image segments obtained at

only one pre-selected segmentation scale, report the follow-

ing results: [16] – SE=47% for buildings, and SE=79%for cars of LabelMe; [22] – SE=7% for Weizmann horses;

and [5] – SE=18.2% for Weizmann horses, and recognition

accuracy of 94.6% for the four categories of Caltech-101.

CSTs yield better, or, in only a few cases, very similar per-

formance.

The presented results demonstrate that our approach is

invariant with respect to: (i) translation, in-plane rotation

and object articulation, since CST itself is invariant to these

changes; (ii) certain degree of scale changes, since match-

ing is based on relative properties of regions; (iii) occlusion

in the training and test sets, since graph-union registers the

entire (unoccluded) category structure from partial views of

7

Simultaneous recognition and segmentation of category "car"

LabelMe Dataset

original ST CST