DeeperCut: A Deeper, Stronger, and Faster Multi … · DeeperCut: A Deeper, Stronger, and Faster...

Preview:

Citation preview

DeeperCut: A Deeper, Stronger, and FasterMulti-Person Pose Estimation Model

Eldar Insafutdinov1, Leonid Pishchulin1, Bjoern Andres1,Mykhaylo Andriluka1,2, and Bernt Schiele1

1Max Planck Institute for Informatics 2Stanford University

Saarbrücken, Germany Stanford, USA

Goal

• Multi-person pose estimation in monocular images

State of the Art

• DeepCut [5]: joint body part labeling and grouping

+ joint reasoning at finest level of details

– weak pairwise based on geometry only

– infeasible run-time: takes hours to complete

Contributions

• A deeper, stronger and faster multi-person model

+ “deeper”: strong part detectors based on ResNet [3]

+ “stronger”: novel image-conditioned pairwise terms

+ “faster”: dramatic speed-ups due to strong pairwise

and incremental optimization

+ NEW: heuristic solver for real-time inference

Unary Terms

• deeper architectures based on Residual Networks [3]

• dilation and de-convolution reduce stride to 8 px

• intermediate supervision: add loss into mid-layers

• joint training of classification and regression tasks

DeeperCut Overview

• Joint part labeling and grouping via 0/1 variables

detection candidates

dense graph

labeled body partsbody part labeling

joint person clusters

I

II IIIIV

d∈D

c∈C

αdc xdc +

dd′∈

(D2

)∑

c,c′∈C

βdd′cc′ xdcxd′c′ydd′

detection part part labeling

{

part clustering

constraints

cost

min(x,y)∈XDC

subset partitioning

∈{0, 1}

I. Unary terms

• Body part detection candidates

• Capture distribution of scores over all part classes

II. Pairwise terms

• Capture part relationships within/across people

– proximity: same body part class (c = c′)

– kinematic relations: different part classes (c!= c′)

III. Integer Linear Program (ILP)

• Substitute zdd ′cc′ = xdc xd ′c′ ydd ′ to linearize objective

• NP-Hard problem solved via branch-and-cut (1% gap)

• Linear constraints on 0/1 labelings: plausible poses

– uniqueness

∀d ∈ D :

c∈C

xdc ≤ 1

– consistency

∀dd ′ ∈�D

2

�: ydd ′ ≤

c∈C

xdc

∀dd ′ ∈�D

2

�: ydd ′ ≤

c∈C

xd ′c

– transitivity

∀dd ′d ′′ ∈�D

3

�: ydd ′+ yd ′d ′′ − 1≤ ydd ′′

Pairwise Terms

• image conditioned pairwise using CNN regression

– train CNN to regress body part locations

– use regressed offsets and angles as features to train

logistic regression to output pairwise probability

regression from left shoulder

regression from right knee

pairwise vs. unary predictions

righ

tkn

ee

all

part

s

regression from all parts unary only

Multi-stage optimization

• speed-up inference via incremental optimization

1. solve for head and shoulder locations

2. add elbows/wrists to stage 1 solution, re-optimize

3. add rest of body parts to stage 2 solution,

re-optimize

Stage 1 Stage 2 Stage 3

head, shoulder elbow, wrist hip, knee, ankle

Quantitative Multi-Person Results

• MPII Multi-Person [1]

– Mean Average Precision (mAP) metric

Setting Head Sho Elb Wri Hip Knee Ank mAP s/frame

subset of 288 images

DeepCut [5] 73.4 71.8 57.9 39.9 56.7 44.0 32.0 54.1 57995

DeeperCut

+image cond. pw. 83.1 75.8 64.6 54.0 60.6 52.0 44.9 62.6 2336

+deeper archit. 83.3 79.4 66.1 57.9 63.5 60.5 49.9 66.2 1333

+multi-st. opt. 87.5 82.8 70.2 61.6 66.0 60.6 56.5 69.7 230

Iqbal&Gall, ECCVw’16 70.0 65.2 56.4 46.1 52.7 47.9 44.5 54.7 10

full set

DeeperCut 79.1 72.2 59.7 50.0 56.0 51.0 44.6 59.4 485

+heuristic solver 79.6 74.0 62.8 52.5 60.0 53.3 44.6 61.4 0.15

FR-CNN [6] + unary 64.9 62.9 53.4 44.1 50.7 43.1 35.2 51.0 1

Iqbal&Gall, ECCVw’16 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10

• We are Family (WAF) [2]

– Percentage of Correct Parts (PCP) metric

Setting Head U Arms L Arms Torso mPCP AOP s/frame

DeepCut [5] 99.3 81.5 79.5 87.1 84.7 86.5 22000

DeeperCut 99.3 83.8 81.9 87.1 86.3 88.1 13

Ghiasi et al., CVPR’14 - - - - 63.6 74.0 -

Eichner&Ferrari, ECCV’10 97.6 68.2 48.1 86.1 69.4 80.0 -

Chen&Yuille, CVPR’15 98.5 77.2 71.3 88.5 80.7 84.9 -

Qualitative Multi-Person Results

• Successful cases

• Failure cases

limbs across symmetry hard poses

people confusion

Single Person Results

• Percentage of Correct Keypoints (PCK) metric

• MPII Single Person dataset [1]

Setting Head Sho Elb Wri Hip Knee Ank PCKh AUC

DeepCut [5] (unary) 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4 56.5

DeeperCut (unary) 96.6 94.6 88.5 84.4 87.6 83.9 79.4 88.3 60.7

Newell et al., ECCV’16 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9 62.9

• Leeds Sports Poses (LSP) [4]

Setting Head Sho Elb Wri Hip Knee Ank PCK AUC

DeepCut [5] (unary) 97.0 91.0 83.8 78.1 91.0 86.7 82.0 87.1 63.5

DeeperCut (unary) 97.4 92.7 87.5 84.4 91.5 89.9 87.2 90.1 66.1

Bulat&Tzimir., ECCV’16 97.2 92.1 88.1 85.2 92.2 91.4 88.7 90.7 63.4

• More comparisons at human-pose.mpi-inf.mpg.de

References[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New

benchmark and state of the art analysis. In CVPR’14.

[2] M. Eichner and V. Ferrari. We are family: Joint pose estimation of multiple persons. In

ECCV’10.

[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv’15.

[4] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human

pose estimation. In BMVC’10.

[5] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele.

Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR’16.

[6] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with

region proposal networks. In NIPS’15.