1
Efficiently Scaling Up Video Annotation with Crowdsourced Marketplaces Carl Vondrick Deva Ramanan Donald Patterson Department of Computer Science University of California, Irvine “Maybe it’s more bizarre that I keep doing these hits for a penny. I must not be the only one who finds them oddly compelling–more and more boxes show up on each hit.” — Anonymous subject Unsolved Problem 1: No fully automatic solution today is capa- ble of detecting and tracking both the people and the basketball. Unsolved Problem 2: Building large video data sets is inefficient because frame-by-frame hand labeling is slow, costly, and tedious. Motivations 1. What is the best division of labor for crowdsource video labeling? 2. What are the tradeoffs between automation and manual labeling? 3. Given a fixed budget, what is the best accuracy we can achieve? Contributions A set of “best-practices” for crowdsourced video annotation. In contrast to [1], can interpolate nonlinear paths w/o much effort. Expanding [2] to analyze tradeoffs between human and CPU cost. Ability to build massive video data sets under a budget. A reusable, open source video annotation platform for affordable, research video labeling. Mechanical Turk Mechanical Turk: online, monetized, crowdsourced marketplace. Ideal for tasks that are hard for computers, but trivial for humans. Workers complete Human Intelligence Tasks and we get results. The “Turk Philosophy” Suggests completely replacing automation with human effort. For Images : annotate every object (highly successful). [3] For Video : hand label every frame (highly inefficient). Given the redundant yet dynamic nature of video, we need an ap- proach that combines the computational power of the CPU with the superior vision capability of humans. References [1]Yuen, J., Russell, B., Liu, C., Torralba, A.: LabelMe video: Build- ing a Video Database with Human Annotations. (2009) [2] Vijayanarasimhan, S., Grauman, K.: Whats It Going to Cost You?: Predicting Effort vs. Informativeness for Multi-Label Image Annotations, CVPR (2009) [3] Sorokin, A., Forsyth, D.: Utility data annotation with amazon mechanical turk. Urbana 51 (2008) 61820 Interactive Video Player Browser video player that guides a worker to label an entity. First, instructs user to draw a box around an item of interest. User then adjusts the box when video pauses on the next key frame. Extracted into frames: removed artifacts from Flash video codec. Carefully manage frame caching to reduce bandwidth. Wider participation across platforms without Flash support. Quality Assurance No quality guarantee. Workers motivated to finish quickly. Experiments indicate 35% of labels were poor (see below). Identify degenerates through hand validation, statistical overlap, heuristic technique, or user agent identification string. Dense Labeling Protocol 1.Worker instructed to annotate an unlabeled entity. 2. If initial frame fully labeled, advance to next key frame. 3. Instructs worker to label again — repeat (2) if still none. 4.If worker can track and work is not degenerate, add to video. 5. Else, if no new objects are discovered, vote to finish. 6. After enough votes, server stops spawning HITs for the video. ··· Video Server Cloud Server written in Python 2.6 and Cython. Client in JavaScript. Entirely open source. Can deploy to clouds without license fees. Linear Interpolation The simplest tracking approach is linear interpolation: b lin t = t T b 0 + T - t T b T for 0 t T But, objects do not necessarily move linearly and can be chaotic. Discriminative Object Templates Extract both HOG and RGB histogram from foregrounds and back- grounds from the annotated frames: φ n (b n )= HOG RGB y n ∈ {-1, 1} Learn a SVM weight vector, w , that minimizes a linear loss: w * = argmin 1 2 w · w + C N X n max(0, 1 - y n w · φ n (b n )) Data is very complex. Simpler templates perform poorly. Can you spot all the difficult objects? Constrained Tracking Calculate a least cost path between constrained endpoints: argmin b 1:T T X t=1 U t (b t )+ P (b t ,b t-1 ) Local cost is SVM score plus linear deviation, but truncated: U t (b t ) = min ( -w · φ t (b t )+ α 1 ||b t - b lin t || 2 2 ) Pairwise cost ensures path is smooth and does not teleport: P (b t ,b t-1 )= α 3 ||b t - b t-1 || 2 Dynamic programming efficiently solves the recursion: cost 0 (b 0 )= U 0 (b 0 ) cost t (b t )= U t (b t ) + min b t-1 cost t-1 (b t-1 )+ P (b t ,b t-1 ) A year’s worth of experiments. Workers produced quality annotations throughout the 210,000 frame basketball game. Experiments & Results Diminishing returns: each click has less impact than previous. The “Turk philosophy” is not efficient for video. Trade-off between CPU cost vs. human cost for max. accuracy. A visual based tracker can benefit video annotation. Interactive vision: with modest human effort, we can deploy algo- rithms that quantify progress in difficult scenarios. Field Drills (easy) Basketball Players (intermediate) Ball (difficult) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Fraction of cpu cost Error 10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$ Optimal tradeoff Field Drills 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Fraction of cpu cost Error 10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$ Optimal tradeoff Ball 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Fraction of cpu cost Error 10$ 20$ 30$ 40$ 50$ Optimal tradeoff Players 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Fraction of cpu cost Error 10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$ 100$ Optimal tradeoff Players (Moore’s law)

E ciently Scaling Up Video Annotation with Crowdsourced ...vondrick/vatic/scalingup-poster.pdf · 4.If worker can track and work is not degenerate, add to video. 5.Else, if no new

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: E ciently Scaling Up Video Annotation with Crowdsourced ...vondrick/vatic/scalingup-poster.pdf · 4.If worker can track and work is not degenerate, add to video. 5.Else, if no new

Efficiently Scaling Up Video Annotation with Crowdsourced MarketplacesCarl Vondrick Deva Ramanan Donald Patterson Department of Computer Science University of California, Irvine

“Maybe it’s more bizarre that I keep doing these hits for a penny.I must not be the only one who finds them oddly compelling–moreand more boxes show up on each hit.” — Anonymous subject

Unsolved Problem 1: No fully automatic solution today is capa-ble of detecting and tracking both the people and the basketball.Unsolved Problem 2: Building large video data sets is inefficientbecause frame-by-frame hand labeling is slow, costly, and tedious.

Motivations

1. What is the best division of labor for crowdsource video labeling?

2. What are the tradeoffs between automation and manual labeling?

3. Given a fixed budget, what is the best accuracy we can achieve?

Contributions

•A set of “best-practices” for crowdsourced video annotation.

• In contrast to [1], can interpolate nonlinear paths w/o much effort.

•Expanding [2] to analyze tradeoffs between human and CPU cost.

•Ability to build massive video data sets under a budget.

•A reusable, open source video annotation platform for affordable,research video labeling.

Mechanical Turk

•Mechanical Turk: online, monetized, crowdsourced marketplace.

• Ideal for tasks that are hard for computers, but trivial for humans.

•Workers complete Human Intelligence Tasks and we get results.

The “Turk Philosophy”

• Suggests completely replacing automation with human effort.

•For Images : annotate every object (highly successful). [3]

•For Video: hand label every frame (highly inefficient).

•Given the redundant yet dynamic nature of video, we need an ap-proach that combines the computational power of the CPU withthe superior vision capability of humans.

References

[1] Yuen, J., Russell, B., Liu, C., Torralba, A.: LabelMe video: Build-ing a Video Database with Human Annotations. (2009)

[2] Vijayanarasimhan, S., Grauman, K.: Whats It Going to CostYou?: Predicting Effort vs. Informativeness for Multi-Label ImageAnnotations, CVPR (2009)

[3] Sorokin, A., Forsyth, D.: Utility data annotation with amazonmechanical turk. Urbana 51 (2008) 61820

Interactive Video Player

•Browser video player that guides a worker to label an entity.

•First, instructs user to draw a box around an item of interest.

•User then adjusts the box when video pauses on the next key frame.

•Extracted into frames: removed artifacts from Flash video codec.

•Carefully manage frame caching to reduce bandwidth.

•Wider participation across platforms without Flash support.

Quality Assurance

•No quality guarantee. Workers motivated to finish quickly.

•Experiments indicate 35% of labels were poor (see below).

• Identify degenerates through hand validation, statistical overlap,heuristic technique, or user agent identification string.

Dense Labeling Protocol

1. Worker instructed to annotate an unlabeled entity.

2. If initial frame fully labeled, advance to next key frame.

3. Instructs worker to label again — repeat (2) if still none.

4. If worker can track and work is not degenerate, add to video.

5. Else, if no new objects are discovered, vote to finish.

6. After enough votes, server stops spawning HITs for the video.

· · ·Video Server Cloud

• Server written in Python 2.6 and Cython. Client in JavaScript.

•Entirely open source. Can deploy to clouds without license fees.

Linear Interpolation

•The simplest tracking approach is linear interpolation:

blint =

(t

T

)b0 +

(T − tT

)bT for 0 ≤ t ≤ T

•But, objects do not necessarily move linearly and can be chaotic.

Discriminative Object Templates

•Extract both HOG and RGB histogram from foregrounds and back-grounds from the annotated frames:

φn(bn) =

[HOGRGB

]yn ∈ {−1, 1}

•Learn a SVM weight vector, w, that minimizes a linear loss:

w∗ = argmin1

2w · w + C

N∑n

max(0, 1− ynw · φn(bn))

•Data is very complex. Simpler templates perform poorly.

Can you spot all the difficult objects?

Constrained Tracking

•Calculate a least cost path between constrained endpoints:

argminb1:T

T∑t=1

Ut(bt) + P (bt, bt−1)

•Local cost is SVM score plus linear deviation, but truncated:

Ut(bt) = min(−w · φt(bt) + α1||bt − blint ||2, α2

)•Pairwise cost ensures path is smooth and does not teleport:

P (bt, bt−1) = α3||bt − bt−1||2

•Dynamic programming efficiently solves the recursion:

cost0(b0) = U0(b0)

costt(bt) = Ut(bt) + minbt−1

costt−1(bt−1) + P (bt, bt−1)

A year’s worth of experiments. Workers produced quality annotations throughout the 210,000 frame basketball game.

Experiments & Results

•Diminishing returns: each click has less impact than previous.

•The “Turk philosophy” is not efficient for video.

•Trade-off between CPU cost vs. human cost for max. accuracy.

•A visual based tracker can benefit video annotation.

• Interactive vision: with modest human effort, we can deploy algo-rithms that quantify progress in difficult scenarios.

0.00 0.05 0.10 0.15 0.20Average clicks per frame

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

Erro

r per

Fra

me

(50%

Ove

rlap)

Performance of interpolation and tracking

dynamicproglinear

CPU Cost

510

15Human Cost

2040

6080

Error

0.2

0.4

0.6

0.8

CPU vs Human

dynamicproglinear

Field Drills (easy)

0.00 0.02 0.04 0.06 0.08 0.10 0.12Average clicks per frame

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

Erro

r per

Fra

me

(50%

Ove

rlap)

Performance of interpolation and tracking

dynamicproglinear

CPU Cost5

1015Human Cost

10 20 30 40

Error

0.1

0.2

0.3

0.4

0.5

0.6

0.7

CPU vs Human

dynamicproglinear

Basketball Players (intermediate)

0.00 0.05 0.10 0.15 0.20Average clicks per frame

0.0

0.2

0.4

0.6

0.8

1.0

Aver

age

Erro

r per

Fra

me

(50%

Ove

rlap)

Performance of interpolation and tracking

dynamicproglinear

CPU Cost

5

10

15

Human Cost

2040

60

80

Error

0.3

0.4

0.5

0.6

0.7

CPU vs Human

dynamicproglinear

Ball (difficult)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Fraction of cpu cost

Err

or

10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$100$Optimal tradeoff

Field Drills

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Fraction of cpu cost

Err

or

10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$100$Optimal tradeoff

Ball

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fraction of cpu cost

Err

or

10$20$30$40$50$Optimal tradeoff

Players

0 0.2 0.4 0.6 0.8 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fraction of cpu cost

Err

or

10$ 20$ 30$ 40$ 50$ 60$ 70$ 80$ 90$100$Optimal tradeoff

Players (Moore’s law)