17
Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan, Adriana Kovashka, Devi Parikh, Prateek Jain, Sung Ju Hwang, and Jeff Donahue Frontiers in Computer Vision Workshop, MIT August 22, 2011

Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Embed Size (px)

Citation preview

Page 1: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Capturing Human Insight for Visual Learning

Kristen GraumanDepartment of Computer Science

University of Texas at Austin

Work with Sudheendra Vijayanarasimhan, Adriana Kovashka, Devi Parikh, Prateek Jain, Sung Ju Hwang,

and Jeff Donahue

Frontiers in Computer Vision Workshop, MITAugust 22, 2011

Page 2: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Problem: how to capture human insight about the visual world?

The complex space of visual objects, activities, and scenes.

[tiny image montage by Torralba et al.]

Annotator

•Point+label “mold” restrictive•Human effort expensive

Page 3: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Problem: how to capture human insight about the visual world?

The complex space of visual objects, activities, and scenes.

[tiny image montage by Torralba et al.]

Annotator

Our approach:

Listen:Explanations,Comparisons,Implied cues,…

Ask:Actively learn

Page 4: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Deepening human communication to the system

What is this?

?

What property is changing here?

What’s worth mentioning?

Do you find him attractive? Why?

How do you know?

Is it ‘furry’?

?

Which is more ‘open’?

<?

[Donahue & Grauman ICCV 2011; Hwang & Grauman BMVC 2010; Parikh & Grauman ICCV 2011, CVPR 2011; Kovashka et al. ICCV 2011]

Page 5: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Soliciting rationales

• We propose to ask the annotator not just what, but also why.

Is the team winning? Is her form perfect?Is it a safe route? How can you tell? How can you tell? How can you tell?

Page 6: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Soliciting rationales

Annotation task: Is her form perfect?How can you tell?

[Donahue & Grauman, ICCV 2011]

pointed toesbalancedfallingknee angled

falling

pointed toes

knee angled

balanced

pointed toes

knee angled

balanced

Synthetic contrast example

Synthetic contrast example

Spatial rationale

Attribute rationaleInfluence on classifier

Goo

d fo

rm

Bad

form

Spatial rationale

Attribute rationale

[Zaidan et al. HLT 2007]

Page 7: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Rationale results• Scene Categories: How can you tell the scene category?

• Hot or Not: What makes them hot (or not)?

• Public Figures: What attributes make them (un)attractive?

Collect rationales from hundreds of MTurk workers.

[Donahue & Grauman, ICCV 2011]

Page 8: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Rationale results

PubFig Originals +Rationales

Male 64.60% 68.14%

Female 51.74% 55.65%

Hot or Not Originals +Rationales

Male 54.86% 60.01%

Female 55.99% 57.07%

Scenes Originals +Rationales

Kitchen 0.1196 0.1395

Living Rm 0.1142 0.1238

Inside City 0.1299 0.1487

Coast 0.4243 0.4513

Highway 0.2240 0.2379

Bedroom 0.3011 0.3167

Street 0.0778 0.0790

Country 0.0926 0.0950

Mountain 0.1154 0.1158

Office 0.1051 0.1052

Tall Building 0.0688 0.0689

Store 0.0866 0.0867

Forest 0.3956 0.4006[Donahue & Grauman, ICCV 2011]

Mean AP

Page 9: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

• Issue: presence of objects != significance

• Our idea: Learn cross-modal representation that accounts for “what to mention”

Learning what to mention

Textual:• Frequency• Relative order• Mutual proximity

Visual:• Texture• Scene• Color…

TAGS:

CowBirdsArchitectureWaterSky

Training: human-given descriptions

BirdsArchitectureWaterCowSkyTiles

Page 10: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Importance-aware seman-tic space

View yView x

[Hwang & Grauman, BMVC 2010]

Learning what to mention

Page 11: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

[Hwang & Grauman, BMVC 2010]

Our method

Words + Visual

Visual only

Query Image

Learning what to mention: results

Page 12: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Problem: how to capture human insight about the visual world?

The complex space of visual objects, activities, and scenes.

[tiny image montage by Torralba et al.]

Annotator

Our approach:

Listen:Explanations,Comparisons,Implied cues

Ask:Actively learn

Page 13: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Traditional active learning

Unlabeled data

Labeled data

Current Model

Active Selection

Annotator

At each cycle, obtain label for the most informative or uncertain example. [Mackay 1992, Freund et al. 1997, Tong & Koller 2001, Lindenbaum et al. 2004, Kapoor et al. 2007,…]

?

Page 14: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

$$

$$ $

Unlabeled data

Labeled data

Current Model

$

Active Selection

Annotator

• Annotation tasks vary in cost and info• Multiple annotators working parallel• Massive unlabeled pools of data

?

[Vijayanarasimhan & Grauman NIPS 2008, CVPR 2009, Vijayanarasimhan et al. CVPR 2010, CVPR 2011, Kovashka et al. ICCV 2011]

Challenges in active visual learning

Page 15: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Current classifier

Unlabeled data

Sub-linear time active selection

[Jain, Vijayanarasimhan, Grauman, NIPS 2010]

110

Hash table

111

101

We propose a novel hashing approach to identify the most uncertain examples in sub-linear time.

Actively selected

examples

For 4.5 million unlabeled instances, 10 minutes machine time per iter,

vs. 60 hours for a naïve scan.

Page 16: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Live active learning resultson

Flic

kr t

est

set

Outperforms status quo data collection approach

[Vijayanarasimhan & Grauman, CVPR 2011]

Page 17: Capturing Human Insight for Visual Learning Kristen Grauman Department of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimhan,

Summary

• Humans are not simply “label machines”

• Widen access to visual knowledge– New forms of input, often requiring associated new

learning algorithms

• Manage large-scale annotation efficiently– Cost-sensitive active question asking

• Live learning: moving beyond canned datasets