Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley

Visual Grouping and RecognitionVisual Grouping and Recognition

Jitendra Malik

University of California at Berkeley

Jitendra Malik

University of California at Berkeley

CollaboratorsCollaborators

• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)

• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren

• Recognition: Serge Belongie, Jan Puzicha

• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)

• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren

• Recognition: Serge Belongie, Jan Puzicha

From images to objectsFrom images to objects

Labeled sets: tiger, grass etc

What enables us to parse a scene?What enables us to parse a scene?

– Low level cues• Color/texture• Contours• Motion

– Mid level cues• T-junctions• Convexity

– High level Cues• Familiar Object• Familiar Motion

– Low level cues• Color/texture• Contours• Motion

– Mid level cues• T-junctions• Convexity

– High level Cues• Familiar Object• Familiar Motion

Grouping factors Grouping factors

But is segmentation a meaningful problem?

But is segmentation a meaningful problem?

• Difficult to define formally, but humans are remarkably consistent…

• Difficult to define formally, but humans are remarkably consistent…

Human Segmentations (1)Human Segmentations (1)

Human Segmentations (2)Human Segmentations (2)

ConsistencyConsistencyA

B C

• A,C are refinements of B• A,C are mutual refinements • A,B,C represent the same percept

• Attention accounts for differences

Image

BG L-bird R-bird

grass bush

headeye

beakfar body

headeye

beak body

Perceptual organization forms a tree:

Two segmentations are consistent when they can beexplained by the samesegmentation tree (i.e. theycould be derived from a single perceptual organization).

Ecological Statistics of image segmentation

Ecological Statistics of image segmentation

• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)

• Design algorithm for incorporating multiple cues for image segmentation

• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)

• Design algorithm for incorporating multiple cues for image segmentation

ProximityProximity

Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)

Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)

ConvexityConvexity

Region AreaRegion Area

• Compare to Alvarez,Gousseau,Morel• Compare to Alvarez,Gousseau,Morel

y = Kx-

= 1.008

Lengths of curvesLengths of curves

100 120 140 160 180 200 220 240140

150

160

170

180

190

200

210

220

230

240

-1 0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

8

9

log(contour length)

log(c

ount)

Image Segmentation as Graph PartitioningImage Segmentation as Graph PartitioningBuild a weighted graph G=(V,E) from image

V: image pixels

E: connections between pairs of nearby pixels

region

same the tobelong

j& iy that probabilit :ijW

Partition graph so that similarity within group is large and similarity between groups is small -- Normalized Cuts [Shi&Malik 97]

Normalized Cut, A measure of dissimilarity

Normalized Cut, A measure of dissimilarity

• Minimum cut is not appropriate since it favors cutting small pieces.

• Normalized Cut, Ncut:

• Minimum cut is not appropriate since it favors cutting small pieces.

• Normalized Cut, Ncut:

V),(

B)A,(

V)A,(

B)A,( B)A,(

Bassoc

cut

assoc

cutNcut

Normalized Cut As Generalized Eigenvalue problem

Normalized Cut As Generalized Eigenvalue problem

• after simplification, we get• after simplification, we get

...

),(

),( ;

11)1(

)1)(()1(

11

)1)(()1(

)VB,(

)BA,(

)VA,(

B)A,(B)A,(

0

i

x

T

T

T

T

iiD

iiDk

Dk

xWDx

Dk

xWDx

assoc

cut

assoc

cutNcut

i

.01},,1{ with ,)(

),(

DybyDyy

yWDyBANcut T

iT

T

Cue-Integration for Image SegmentationCue-Integration for Image Segmentation

[Malik, Belongie, Shi, Leung 1999]

On image segmentation..On image segmentation..

• Humans are quite consistent, so model the goal as emulating their behavior.

• Ecological statistics of grouping cues can be learned from image data.

• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.

• Humans are quite consistent, so model the goal as emulating their behavior.

• Ecological statistics of grouping cues can be learned from image data.

• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.

Framework for RecognitionFramework for Recognition(1) Segmentation

PixelsSegments(2) Association

SegmentsRegions(3) Matching

RegionsPrototypes

Over-segmentation necessary; Under-segmentation fatal

Enumerate: # of size k regions in image with n segments is ~(4**k)*n/k

~10 views/object. Matching tolerant to pose/illumination changes, intra-category variation, error in previous steps

Matching regions to viewsMatching regions to views

• GOAL: obtain small misclassification error using few views

• Matching allowing deformations of prototype views makes this possible

• GOAL: obtain small misclassification error using few views

• Matching allowing deformations of prototype views makes this possible

Matching with original and deformed prototypesPrototype Test Error

Deforming Biological ShapesDeforming Biological Shapes

• D’Arcy Thompson: On Growth and Form, 1917– studied transformations between shapes of organisms

• Find correspondences between points on shape

• Estimate transformation

• Measure similarity

model target

...

Finding correspondences between shapes

Finding correspondences between shapes

• Each shape is represented by a set of sample points

• Each sample point has a descriptor – the shape context

• Define cost Wij for matching point i on first shape with point j on second shape.

• Solve for correspondence as optimum assignment.

• Each shape is represented by a set of sample points

• Each sample point has a descriptor – the shape context

• Define cost Wij for matching point i on first shape with point j on second shape.

• Solve for correspondence as optimum assignment.

Shape ContextShape ContextCount the number of points inside each bin, e.g.:

Count = 4

Count = 10

...

Compact representation of distribution of points relative to each point

Comparing Shape ContextsComparing Shape ContextsCompute matching costs using Chi Squared Test:

Recover correspondences by solving linear assignment problem with costs Cij

[Jonker & Volgenant 1987]

MatchingExampleMatchingExample

model target

Synthetic Test ResultsSynthetic Test ResultsFish - deformation + noise Fish - deformation + outliers

ICP Shape Context RPM

Measuring Shape SimilarityMeasuring Shape Similarity• Image appearance around matched points

– color or gray-level window– orientation

• Shape context differences at matched points

• Bending Energy

• Image appearance around matched points– color or gray-level window– orientation

• Shape context differences at matched points

• Bending Energy

COIL Object Database

Editing: PrototypesEditing: Prototypes

• Human Shape Perception• Computational Needs for K-NN

• Human Shape Perception• Computational Needs for K-NN

Prototype Selection: Coil-20Prototype Selection: Coil-20

MNIST Handwritten Digits

Handwritten Digit RecognitionHandwritten Digit Recognition• MNIST 60 000:

– linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 60 000: – linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%



– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 600 000 (distortions): – LeNet 5: 0.8%– SVM: 0.8%– Boosted LeNet 4: 0.7%

• MNIST 20 000: – K-NN, Shape Context

matching: 0.63%

Hand-written Digit RecognitionHand-written Digit Recognition• MNIST 60 000:

– linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%



– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 60 000: – linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%



– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 600 000 (distortions): – LeNet 5: 0.8%

– SVM: 0.8%

– Boosted LeNet 4: 0.7%

• MNIST 20 000– K-NN, Shape context

matching: 0.63 %

Results: Digit RecognitionResults: Digit Recognition

1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance

1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance

Results: Digit Recognition (Detail)

Results: Digit Recognition (Detail)

Trademark SimilarityTrademark Similarity

Future work..Future work..

• Indexing based on color/texture/shape features before correspondence matching

• Integrate segmentation and recognition

• Indexing based on color/texture/shape features before correspondence matching

• Integrate segmentation and recognition

Computing cost on a Pentium PCComputing cost on a Pentium PC

• Segmentation: 2 minutes /image (200x100)

• Matching : 0.2 sec / match (100 points)

• Segmentation: 2 minutes /image (200x100)

• Matching : 0.2 sec / match (100 points)

Given a 104 speedup..Given a 104 speedup..

• 5K object categories/sec

• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.

• 5K object categories/sec

• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.

Documents

Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley