Visual Grouping and Recognition Jitendra Malik University of California at Berkeley Jitendra Malik...

Preview:

Citation preview

Visual Grouping and RecognitionVisual Grouping and Recognition

Jitendra Malik

University of California at Berkeley

Jitendra Malik

University of California at Berkeley

CollaboratorsCollaborators

• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)

• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren

• Recognition: Serge Belongie, Jan Puzicha

• Grouping: Jianbo Shi (CMU), Serge Belongie, Thomas Leung (Compaq CRL)

• Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren

• Recognition: Serge Belongie, Jan Puzicha

From images to objectsFrom images to objects

Labeled sets: tiger, grass etc

What enables us to parse a scene?What enables us to parse a scene?

– Low level cues• Color/texture• Contours• Motion

– Mid level cues• T-junctions• Convexity

– High level Cues• Familiar Object• Familiar Motion

– Low level cues• Color/texture• Contours• Motion

– Mid level cues• T-junctions• Convexity

– High level Cues• Familiar Object• Familiar Motion

Grouping factors Grouping factors

But is segmentation a meaningful problem?

But is segmentation a meaningful problem?

• Difficult to define formally, but humans are remarkably consistent…

• Difficult to define formally, but humans are remarkably consistent…

Human Segmentations (1)Human Segmentations (1)

Human Segmentations (2)Human Segmentations (2)

ConsistencyConsistencyA

B C

• A,C are refinements of B• A,C are mutual refinements • A,B,C represent the same percept

• Attention accounts for differences

Image

BG L-bird R-bird

grass bush

headeye

beakfar body

headeye

beak body

Perceptual organization forms a tree:

Two segmentations are consistent when they can beexplained by the samesegmentation tree (i.e. theycould be derived from a single perceptual organization).

Ecological Statistics of image segmentation

Ecological Statistics of image segmentation

• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)

• Design algorithm for incorporating multiple cues for image segmentation

• Measure the conditional probability distribution of various grouping cues in human segmented images (Brunswik 1950)

• Design algorithm for incorporating multiple cues for image segmentation

ProximityProximity

Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)

Similarity of brightness(cf. Coughlan & Yuille, Geman & Jedynek)

ConvexityConvexity

Region AreaRegion Area

• Compare to Alvarez,Gousseau,Morel• Compare to Alvarez,Gousseau,Morel

y = Kx-

= 1.008

Lengths of curvesLengths of curves

100 120 140 160 180 200 220 240140

150

160

170

180

190

200

210

220

230

240

-1 0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

8

9

log(contour length)

log(c

ount)

Image Segmentation as Graph PartitioningImage Segmentation as Graph PartitioningBuild a weighted graph G=(V,E) from image

V: image pixels

E: connections between pairs of nearby pixels

region

same the tobelong

j& iy that probabilit :ijW

Partition graph so that similarity within group is large and similarity between groups is small -- Normalized Cuts [Shi&Malik 97]

Normalized Cut, A measure of dissimilarity

Normalized Cut, A measure of dissimilarity

• Minimum cut is not appropriate since it favors cutting small pieces.

• Normalized Cut, Ncut:

• Minimum cut is not appropriate since it favors cutting small pieces.

• Normalized Cut, Ncut:

V),(

B)A,(

V)A,(

B)A,( B)A,(

Bassoc

cut

assoc

cutNcut

Normalized Cut As Generalized Eigenvalue problem

Normalized Cut As Generalized Eigenvalue problem

• after simplification, we get• after simplification, we get

...

),(

),( ;

11)1(

)1)(()1(

11

)1)(()1(

)VB,(

)BA,(

)VA,(

B)A,(B)A,(

0

i

x

T

T

T

T

iiD

iiDk

Dk

xWDx

Dk

xWDx

assoc

cut

assoc

cutNcut

i

.01},,1{ with ,)(

),(

DybyDyy

yWDyBANcut T

iT

T

Cue-Integration for Image SegmentationCue-Integration for Image Segmentation

[Malik, Belongie, Shi, Leung 1999]

On image segmentation..On image segmentation..

• Humans are quite consistent, so model the goal as emulating their behavior.

• Ecological statistics of grouping cues can be learned from image data.

• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.

• Humans are quite consistent, so model the goal as emulating their behavior.

• Ecological statistics of grouping cues can be learned from image data.

• We now have a generic image segmentation algorithm (code available) which can be applied for MPEG-4/7 compression and object recognition.

Framework for RecognitionFramework for Recognition(1) Segmentation

PixelsSegments(2) Association

SegmentsRegions(3) Matching

RegionsPrototypes

Over-segmentation necessary; Under-segmentation fatal

Enumerate: # of size k regions in image with n segments is ~(4**k)*n/k

~10 views/object. Matching tolerant to pose/illumination changes, intra-category variation, error in previous steps

Matching regions to viewsMatching regions to views

• GOAL: obtain small misclassification error using few views

• Matching allowing deformations of prototype views makes this possible

• GOAL: obtain small misclassification error using few views

• Matching allowing deformations of prototype views makes this possible

Matching with original and deformed prototypesPrototype Test Error

Deforming Biological ShapesDeforming Biological Shapes

• D’Arcy Thompson: On Growth and Form, 1917– studied transformations between shapes of organisms

• Find correspondences between points on shape

• Estimate transformation

• Measure similarity

model target

...

Finding correspondences between shapes

Finding correspondences between shapes

• Each shape is represented by a set of sample points

• Each sample point has a descriptor – the shape context

• Define cost Wij for matching point i on first shape with point j on second shape.

• Solve for correspondence as optimum assignment.

• Each shape is represented by a set of sample points

• Each sample point has a descriptor – the shape context

• Define cost Wij for matching point i on first shape with point j on second shape.

• Solve for correspondence as optimum assignment.

Shape ContextShape ContextCount the number of points inside each bin, e.g.:

Count = 4

Count = 10

...

Compact representation of distribution of points relative to each point

Comparing Shape ContextsComparing Shape ContextsCompute matching costs using Chi Squared Test:

Recover correspondences by solving linear assignment problem with costs Cij

[Jonker & Volgenant 1987]

MatchingExampleMatchingExample

model target

Synthetic Test ResultsSynthetic Test ResultsFish - deformation + noise Fish - deformation + outliers

ICP Shape Context RPM

Measuring Shape SimilarityMeasuring Shape Similarity• Image appearance around matched points

– color or gray-level window– orientation

• Shape context differences at matched points

• Bending Energy

• Image appearance around matched points– color or gray-level window– orientation

• Shape context differences at matched points

• Bending Energy

COIL Object Database

Editing: PrototypesEditing: Prototypes

• Human Shape Perception• Computational Needs for K-NN

• Human Shape Perception• Computational Needs for K-NN

Prototype Selection: Coil-20Prototype Selection: Coil-20

MNIST Handwritten Digits

Handwritten Digit RecognitionHandwritten Digit Recognition• MNIST 60 000:

– linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 60 000: – linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 600 000 (distortions): – LeNet 5: 0.8%– SVM: 0.8%– Boosted LeNet 4: 0.7%

• MNIST 20 000: – K-NN, Shape Context

matching: 0.63%

Hand-written Digit RecognitionHand-written Digit Recognition• MNIST 60 000:

– linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 60 000: – linear: 12.0%

– 40 PCA+ quad: 3.3%

– 1000 RBF +linear: 3.6%

– K-NN: 5%

– K-NN (deskewed): 2.4%

– K-NN (tangent dist.): 1.1%

– SVM: 1.1%

– LeNet 5: 0.95%

• MNIST 600 000 (distortions): – LeNet 5: 0.8%

– SVM: 0.8%

– Boosted LeNet 4: 0.7%

• MNIST 20 000– K-NN, Shape context

matching: 0.63 %

Results: Digit RecognitionResults: Digit Recognition

1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance

1-NN classifier using:Shape context + 0.3 * bending + 1.6 * image appearance

Results: Digit Recognition (Detail)

Results: Digit Recognition (Detail)

Trademark SimilarityTrademark Similarity

Future work..Future work..

• Indexing based on color/texture/shape features before correspondence matching

• Integrate segmentation and recognition

• Indexing based on color/texture/shape features before correspondence matching

• Integrate segmentation and recognition

Computing cost on a Pentium PCComputing cost on a Pentium PC

• Segmentation: 2 minutes /image (200x100)

• Matching : 0.2 sec / match (100 points)

• Segmentation: 2 minutes /image (200x100)

• Matching : 0.2 sec / match (100 points)

Given a 104 speedup..Given a 104 speedup..

• 5K object categories/sec

• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.

• 5K object categories/sec

• Humans can recognize 10K -100K objects, so we could be in the ballpark of human level vision by 2020.

Recommended