2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008

2D Human Pose Estimation2D Human Pose Estimationin TV Showsin TV Shows

Vittorio FerrariVittorio FerrariManuel MarinManuel Marin

Andrew ZissermanAndrew Zisserman

Dagstuhl SeminarDagstuhl Seminar

July 2008July 2008

GoalGoal: detect people and estimate 2D pose in images/video: detect people and estimate 2D pose in images/video

Posespatial configuration of body parts

Estimationlocalize body parts in

Desired scopeTV shows and feature films

focus on upper-body

Body parts

fixed aspect-ratio rectangles for head, torso, upper and lower armsRamanan and Forsyth, Mori et al., Sigal and Black,

Felzenszwalb and Huttenlocher, Bergtholdt et al.

The search space of pictorial structures is largeThe search space of pictorial structures is large

Search space

- 4P dim (a scale factor per part)

- 3P+1 dims (a single scale factor)

- P = 6 for upper-body

- 720x405 image = 10^45 configs !

Kinematic constraints

- reduce space to valid body configs (10^28)

- efficiency by model independencies (10^12)

Body parts

fixed aspect-ratio rectangles for head, torso, upper and lower arms

= 4 parameters each

€

(x, y,θ,s)

Ramanan and Forsyth, Mori et al., Sigal and Black,Felzenszwalb and Huttenlocher, Bergtholdt et al.

The challenge of unconstrained imageryThe challenge of unconstrained imagery

varying illumination and low contrast; moving camera and background;multiple people; scale changes; extensive clutter; any clothing

The challenge of unconstrained imageryThe challenge of unconstrained imagery

Extremely difficult when knowing nothing about appearance/pose/location

Approach overviewApproach overview

Humandetection

Foregroundhighlighting

reducesearch space

estimate pose

new additions

Ramanan NIPS 2006

every frameindependently

multiple framesjointly

Image parsing

Learn appearancemodels over

multiple frames

Spatio-temporalparsing

Single frameSingle frame

Search space reduction by human detectionSearch space reduction by human detection

+ fixes scale of body parts

+ sets bounds on x,y locations

- little info about pose (arms)

Ideaget approximate location and scale with adetector generic over pose and appearance

Building an upper-body detector

- based on Dalal and Triggs CVPR 2005

- train = 96 frames X 12 perturbations (no Buffy)

+ fast

Benefits for pose estimation

detected enlarged

+ detects also back views

Train

Test

Search space reduction by human detectionSearch space reduction by human detection

Upper body detection and temporal association

QuickTime™ and a decompressor

are needed to see this picture.

code

availab

le!

www.robots.ox.ac.uk/~vgg/software/UpperBody/index.html

Search space reduction by foreground highlightingSearch space reduction by foreground highlighting

+ further reduce clutter

+ conservative (no loss 95.5% times)

Ideaexploit knowledge about structure of search area to initialize Grabcut(Rother et al. 2004)

Initialization- learn fg/bg models from regions where person likely present/absent

- clamp central strip to fg

- don’t clamp bg (arms can be anywhere)


+ needs no knowledge of background

+ allows for moving backgroundinitialization output

Search space reduction by foreground highlightingSearch space reduction by foreground highlighting

+ further reduce clutter

+ conservative (no loss 95.5% times)

Initialization- learn fg/bg models from regions where person likely present/absent

- clamp central strip to fg

- don’t clamp bg (arms can be anywhere)


+ needs no knowledge of background

+ allows for moving background

Ideaexploit knowledge about structure of search area to initialize Grabcut(Rother et al. 2004)

Pose estimation by image parsingPose estimation by image parsing

+ much more robust+ much faster (10x-100x)

Goalestimate posterior of part configuration

Algorithm1. inference with = edges

2. learn appearance models of body parts and background

3. inference with = edges + appearance

Advantages of space reduction

€

Φ

= image evidence (given edge/app models)

€

Φ = spatial prior (kinematic constraints)

€

Ψ

€

Ψ

€

Ψ

€

Ψ

€

Ψ

Ramanan 06


appearancemodels

edgeparse

edge + appparse + much more robust

+ much faster (10x-100x)





Advantages of space reduction€

Φ

€

Φ


€


€

Ψ€

P(L | I)∝ exp Φ(li) + Ψ(li, l j )i,i∈E

∑i

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

Ramanan 06


appearancemodels

edgeparse

edge + appparse + much more robust

+ much faster (10x-100x)





Advantages of space reduction€

Φ

€

Φ

no foreground highlighting


€


€

Ψ€


∑i

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

Ramanan 06

Failure of direct pose estimationFailure of direct pose estimation Ramanan NIPS 2006 unaided

Ramanan 06

Multiple framesMultiple frames

Transferring appearance models across framesTransferring appearance models across framesIdearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)

lowest-entropy frames

higher-entropy frameAdvantages

+ improve parse in difficult frames

+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Algorithm

1. select frames with low entropy of P(L|I)

2. integrate their appearance models by min KL-divergence

3. re-parse every frame using integrated appearance models

lowest-entropy frames

higher-entropy frame

Idearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)

Transferring appearance models across framesTransferring appearance models across frames

Algorithm




Advantages





parse

Algorithm




Advantages





re-parse

Algorithm




Advantages



Ideaextend kinematic tree with edgespreferring non-overlapping left/right arms

Model

- add repulsive edges

- inference with Loopy Belief Propagation

Advantage

+ less double-counting

The repulsive modelThe repulsive model

€

lH

€

lT€

Ψ

€

lLUA

€

lRUA

€

Ψ

€

Ψ

€

lRLA

€

lLLA

= kinematic constraints

€

Ψ

= repulsive prior

€

Ψ

€

Ψ

Ideaextend kinematic tree with edgespreferring non-overlapping left/right arms

Model

- add repulsive edges


Advantage

+ less double-counting

The repulsive modelThe repulsive model

€

lH

€

lT€

Ψ

€

lLUA

€

lRUA

€

Ψ

€

Ψ

€

lRLA

€

lLLA

= kinematic constraints

€

Ψ

= repulsive prior

€

Λ

€

Ψ

€

Ψ

€


∑i

∑ + Λ(lLkA , lRkA )k∈{U ,L}

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

€

Λ

€

Λ

Spatio-temporal parsingSpatio-temporal parsing

Transferring appearance modelsexploits continuity of appearance

Image parsing


multiple frames

After repulsive model

Now exploit continuity of position

time


Idea

extend model to a spatio-temporal block

€

Ψ

time

€

Ψ

€

Ψ€

Ω

€

Ω

€

Ω

€

Ψ

€

Ψ

€

Ψ

€

Ψ

€

Ψ

€

Ψ

€

Ω

€

Ω

Model

- add temporal edges


= image evidence (given edge+app models)

€


€

Ψ

= temporal prior (continuity constraints)

€

Ω


Idea

extend model to a spatio-temporal block

€

lH1

€

lH2

€

lHn

€

l1T

€

l2T

€

lnT

€

Ψ

time

€

Ψ

€

Ψ€

Ω

€

Ω

€

Ω

€

Ω

€

l1LA

€

l1RA

€

l2LA

€

l2RA

€

lnLA

€

lnRA

€

Ψ

€

Ψ

€

Ψ

€

Ψ

€

Ψ

€

Ψ

€

Ω

€

Ω

Model

- add temporal edges


= image evidence (given edge+app models)

€


€

Ψ

= temporal prior (continuity constraints)

€

Ω

= repulsive prior (anti double-counting)

€

Λ€

Λ

€

Λ

€

Λ


Integrated spatio-temporal parsingreduces uncertainty

Image parsing


multiple frames

Spatio-temporalparsing

spatio-temporal

no spatio-temporal

Full-body pose estimation in easier conditionsFull-body pose estimation in easier conditions



Weizmann action dataset (Blank et at. ICCV 05)







Evaluation datasetEvaluation dataset

- >70000 frames over 4 episodes of Buffy the Vampire Slayer (>1000 shots)

- uncontrolled and extremely challenging low contrast, scale changes, moving camera and background, extensive clutter, any clothing, any pose

- figure overlays = final output, after spatio-temporal inference

Example estimated posesExample estimated poses

Quantitative evaluationQuantitative evaluation

Ground-truth 69 shots x 4 = 276 frames x 6 = 1656 body parts (sticks)

Upper-body detector

fires on 243 frames (88%, 1458 body parts)

Data

availab

le!

http://www.robots.ox.ac.uk/~vgg/data/stickmen/index.html

Quantitative evaluationQuantitative evaluation Pose estimation accuracy

62.6% with repulsive model ( with spatio-temporal parsing)

59.4% with appearance transfer

57.9% with foreground highlighting (best single-frame)

41.2% with detection

9.6% Ramanan NIPS 2006 unaided

Conclusions:

+ both reduction techniques improve results

+= small improvement by appearance transfer …

+ … method good also on static images

+ repulsive model brings us further

= spatio-temporal parsing mostly aesthetic impact

€

≅

€

≅

Video !Video !

Documents

2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008