38
2D Human Pose Estimation 2D Human Pose Estimation in TV Shows in TV Shows Vittorio Ferrari Vittorio Ferrari Manuel Marin Manuel Marin Andrew Zisserman Andrew Zisserman Dagstuhl Seminar Dagstuhl Seminar July 2008 July 2008

2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008

Embed Size (px)

Citation preview

2D Human Pose Estimation2D Human Pose Estimationin TV Showsin TV Shows

Vittorio FerrariVittorio FerrariManuel MarinManuel Marin

Andrew ZissermanAndrew Zisserman

Dagstuhl SeminarDagstuhl Seminar

July 2008July 2008

GoalGoal: detect people and estimate 2D pose in images/video: detect people and estimate 2D pose in images/video

Posespatial configuration of body parts

Estimationlocalize body parts in

Desired scopeTV shows and feature films

focus on upper-body

Body parts

fixed aspect-ratio rectangles for head, torso, upper and lower armsRamanan and Forsyth, Mori et al., Sigal and Black,

Felzenszwalb and Huttenlocher, Bergtholdt et al.

The search space of pictorial structures is largeThe search space of pictorial structures is large

Search space

- 4P dim (a scale factor per part)

- 3P+1 dims (a single scale factor)

- P = 6 for upper-body

- 720x405 image = 10^45 configs !

Kinematic constraints

- reduce space to valid body configs (10^28)

- efficiency by model independencies (10^12)

Body parts

fixed aspect-ratio rectangles for head, torso, upper and lower arms

= 4 parameters each

(x, y,θ,s)

Ramanan and Forsyth, Mori et al., Sigal and Black,Felzenszwalb and Huttenlocher, Bergtholdt et al.

The challenge of unconstrained imageryThe challenge of unconstrained imagery

varying illumination and low contrast; moving camera and background;multiple people; scale changes; extensive clutter; any clothing

The challenge of unconstrained imageryThe challenge of unconstrained imagery

Extremely difficult when knowing nothing about appearance/pose/location

Approach overviewApproach overview

Humandetection

Foregroundhighlighting

reducesearch space

estimate pose

new additions

Ramanan NIPS 2006

every frameindependently

multiple framesjointly

Image parsing

Learn appearancemodels over

multiple frames

Spatio-temporalparsing

Single frameSingle frame

Search space reduction by human detectionSearch space reduction by human detection

+ fixes scale of body parts

+ sets bounds on x,y locations

- little info about pose (arms)

Ideaget approximate location and scale with adetector generic over pose and appearance

Building an upper-body detector

- based on Dalal and Triggs CVPR 2005

- train = 96 frames X 12 perturbations (no Buffy)

+ fast

Benefits for pose estimation

detected enlarged

+ detects also back views

Train

Test

Search space reduction by human detectionSearch space reduction by human detection

Upper body detection and temporal association

QuickTime™ and a decompressor

are needed to see this picture.

code

availab

le!

www.robots.ox.ac.uk/~vgg/software/UpperBody/index.html

Search space reduction by foreground highlightingSearch space reduction by foreground highlighting

+ further reduce clutter

+ conservative (no loss 95.5% times)

Ideaexploit knowledge about structure of search area to initialize Grabcut(Rother et al. 2004)

Initialization- learn fg/bg models from regions where person likely present/absent

- clamp central strip to fg

- don’t clamp bg (arms can be anywhere)

Benefits for pose estimation

+ needs no knowledge of background

+ allows for moving backgroundinitialization output

Search space reduction by foreground highlightingSearch space reduction by foreground highlighting

+ further reduce clutter

+ conservative (no loss 95.5% times)

Initialization- learn fg/bg models from regions where person likely present/absent

- clamp central strip to fg

- don’t clamp bg (arms can be anywhere)

Benefits for pose estimation

+ needs no knowledge of background

+ allows for moving background

Ideaexploit knowledge about structure of search area to initialize Grabcut(Rother et al. 2004)

Pose estimation by image parsingPose estimation by image parsing

+ much more robust+ much faster (10x-100x)

Goalestimate posterior of part configuration

Algorithm1. inference with = edges

2. learn appearance models of body parts and background

3. inference with = edges + appearance

Advantages of space reduction

Φ

= image evidence (given edge/app models)

Φ = spatial prior (kinematic constraints)

Ψ

Ψ

Ψ

Ψ

Ψ

Ramanan 06

Pose estimation by image parsingPose estimation by image parsing

appearancemodels

edgeparse

edge + appparse + much more robust

+ much faster (10x-100x)

Goalestimate posterior of part configuration

Algorithm1. inference with = edges

2. learn appearance models of body parts and background

3. inference with = edges + appearance

Advantages of space reduction€

Φ

Φ

= image evidence (given edge/app models)

Φ = spatial prior (kinematic constraints)

Ψ€

P(L | I)∝ exp Φ(li) + Ψ(li, l j )i,i∈E

∑i

∑ ⎛

⎝ ⎜

⎠ ⎟

Ramanan 06

Pose estimation by image parsingPose estimation by image parsing

appearancemodels

edgeparse

edge + appparse + much more robust

+ much faster (10x-100x)

Goalestimate posterior of part configuration

Algorithm1. inference with = edges

2. learn appearance models of body parts and background

3. inference with = edges + appearance

Advantages of space reduction€

Φ

Φ

no foreground highlighting

= image evidence (given edge/app models)

Φ = spatial prior (kinematic constraints)

Ψ€

P(L | I)∝ exp Φ(li) + Ψ(li, l j )i,i∈E

∑i

∑ ⎛

⎝ ⎜

⎠ ⎟

Ramanan 06

Failure of direct pose estimationFailure of direct pose estimation Ramanan NIPS 2006 unaided

Ramanan 06

Multiple framesMultiple frames

Transferring appearance models across framesTransferring appearance models across framesIdearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)

lowest-entropy frames

higher-entropy frameAdvantages

+ improve parse in difficult frames

+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Algorithm

1. select frames with low entropy of P(L|I)

2. integrate their appearance models by min KL-divergence

3. re-parse every frame using integrated appearance models

lowest-entropy frames

higher-entropy frame

Idearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)

Transferring appearance models across framesTransferring appearance models across frames

Algorithm

1. select frames with low entropy of P(L|I)

2. integrate their appearance models by min KL-divergence

3. re-parse every frame using integrated appearance models

Advantages

+ improve parse in difficult frames

+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Idearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)

Transferring appearance models across framesTransferring appearance models across frames

parse

Algorithm

1. select frames with low entropy of P(L|I)

2. integrate their appearance models by min KL-divergence

3. re-parse every frame using integrated appearance models

Advantages

+ improve parse in difficult frames

+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Idearefine parsing of difficult frames, based on appearance models from confident ones(exploit continuity of appearance)

Transferring appearance models across framesTransferring appearance models across frames

re-parse

Algorithm

1. select frames with low entropy of P(L|I)

2. integrate their appearance models by min KL-divergence

3. re-parse every frame using integrated appearance models

Advantages

+ improve parse in difficult frames

+ better than Ramanan CVPR 2005: richer and more robust models need no specific pose at any point in video

Ideaextend kinematic tree with edgespreferring non-overlapping left/right arms

Model

- add repulsive edges

- inference with Loopy Belief Propagation

Advantage

+ less double-counting

The repulsive modelThe repulsive model

lH

lT€

Ψ

lLUA

lRUA

Ψ

Ψ

lRLA

lLLA

= kinematic constraints

Ψ

= repulsive prior

Ψ

Ψ

Ideaextend kinematic tree with edgespreferring non-overlapping left/right arms

Model

- add repulsive edges

- inference with Loopy Belief Propagation

Advantage

+ less double-counting

The repulsive modelThe repulsive model

lH

lT€

Ψ

lLUA

lRUA

Ψ

Ψ

lRLA

lLLA

= kinematic constraints

Ψ

= repulsive prior

Λ

Ψ

Ψ

P(L | I)∝ exp Φ(li) + Ψ(li, l j )i,i∈E

∑i

∑ + Λ(lLkA , lRkA )k∈{U ,L}

∑ ⎛

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Λ

Λ

Spatio-temporal parsingSpatio-temporal parsing

Transferring appearance modelsexploits continuity of appearance

Image parsing

Learn appearancemodels over

multiple frames

After repulsive model

Now exploit continuity of position

time

Spatio-temporal parsingSpatio-temporal parsing

Idea

extend model to a spatio-temporal block

Ψ

time

Ψ

Ψ€

Ω

Ω

Ω

Ψ

Ψ

Ψ

Ψ

Ψ

Ψ

Ω

Ω

Model

- add temporal edges

- inference with Loopy Belief Propagation

= image evidence (given edge+app models)

Φ = spatial prior (kinematic constraints)

Ψ

= temporal prior (continuity constraints)

Ω

Spatio-temporal parsingSpatio-temporal parsing

Idea

extend model to a spatio-temporal block

lH1

lH2

lHn

l1T

l2T

lnT

Ψ

time

Ψ

Ψ€

Ω

Ω

Ω

Ω

l1LA

l1RA

l2LA

l2RA

lnLA

lnRA

Ψ

Ψ

Ψ

Ψ

Ψ

Ψ

Ω

Ω

Model

- add temporal edges

- inference with Loopy Belief Propagation

= image evidence (given edge+app models)

Φ = spatial prior (kinematic constraints)

Ψ

= temporal prior (continuity constraints)

Ω

= repulsive prior (anti double-counting)

Λ€

Λ

Λ

Λ

Spatio-temporal parsingSpatio-temporal parsing

Integrated spatio-temporal parsingreduces uncertainty

Image parsing

Learn appearancemodels over

multiple frames

Spatio-temporalparsing

spatio-temporal

no spatio-temporal

Full-body pose estimation in easier conditionsFull-body pose estimation in easier conditions

QuickTime™ and a decompressor

are needed to see this picture.

Weizmann action dataset (Blank et at. ICCV 05)

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Evaluation datasetEvaluation dataset

- >70000 frames over 4 episodes of Buffy the Vampire Slayer (>1000 shots)

- uncontrolled and extremely challenging low contrast, scale changes, moving camera and background, extensive clutter, any clothing, any pose

- figure overlays = final output, after spatio-temporal inference

Example estimated posesExample estimated poses

Quantitative evaluationQuantitative evaluation

Ground-truth 69 shots x 4 = 276 frames x 6 = 1656 body parts (sticks)

Upper-body detector

fires on 243 frames (88%, 1458 body parts)

Data

availab

le!

http://www.robots.ox.ac.uk/~vgg/data/stickmen/index.html

Quantitative evaluationQuantitative evaluation Pose estimation accuracy

62.6% with repulsive model ( with spatio-temporal parsing)

59.4% with appearance transfer

57.9% with foreground highlighting (best single-frame)

41.2% with detection

9.6% Ramanan NIPS 2006 unaided

Conclusions:

+ both reduction techniques improve results

+= small improvement by appearance transfer …

+ … method good also on static images

+ repulsive model brings us further

= spatio-temporal parsing mostly aesthetic impact

Video !Video !