VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early...

VNect: Real-time 3D Human Pose Estimation with a Single

RGB Camera By Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin,

Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, Christian Teobalt

Max Planck Institute for Informatics, Saarland University, Universidad Rey Juan Carlos

Presented by Asbjoern Fintland Lystrup and Marcus Loo Vergara

Goal Real-time markerless 3D pose estimation from single RGB camera

Temporal stability

Invariant to background and body shape

Invariant to input image size

Excerpt starting at: https://youtu.be/W1ZNFfftx2E?t=0m5s

Previous Works

Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation

Deep learning methods represents the state-of-the-art

Issues ◦ 2D pose estimation is not sufficient for certain tasks e.g. virtual avatar control

◦ Typically assumes tight bounding boxes

Advantages ◦ Real-time

◦ High accuracy

RGB-D 3D pose estimation using RGB with depth (e.g. Microsoft Kinect)

Issues ◦ Does not work well in outdoors due to sunlight interference

◦ Bulkier, more expensive, not as widely available, higher power consumption, limited resolution, field-of-view and range

Advantages ◦ Tracking of deformable objects

◦ Template-free reconstruction

RGB Depth Segmented

Multi-view 3D pose estimation using multiple cameras

Issues ◦ Needs elaborate setup

◦ Offline computation ◦ Typically not real-time

Advantages ◦ Attains high accuracy

◦ Can reach real-time with approximations

Monocular 3D Pose Estimation Previous work in monocular 3D pose estimation uses deep learning

Issues ◦ Typically offline

◦ Often reconstructs 3D joint positions individually per image

◦ Temporally unstable when applied to sequences of images

◦ Does not enforce constant bone lengths

Method

Overview

Bounding Box Tracker Goal: Efficiently create a tight bound around the person

Want to avoid slow scale-space search for bounding box (BB)

Bounding Box Tracker First frames

◦ Scale-space search

Then ◦ Use previous keypoints to compute

the smallest BB containing all keypoints

◦ Add 20% to the height and 40% to the width

◦ Shift BB horizontally to the centroid of the keypoints

◦ Corners of the BB are updated using a weighted average with the previous frame’s corners

◦ Finally, BB is resized to 368x368 px

Excerpt starting at: https://youtu.be/W1ZNFfftx2E?t=1m24s

CNN Pose Regression Goal: Predict 2D and 3D joint positions

◦ 2D: Image space heatmap formulation

◦ 3D: Position relative to the root (pelvis)

CNN Pose Regression Approach

◦ Generate heatmap 𝐻𝑗 for each joint 𝑗

◦ Generate location-maps 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗 for each joint 𝑗

◦ Captures root-relative locations 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗

◦ 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 are read from 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗 at the respective position in the heatmap 𝐻𝑗

CNN Pose Regression Modified ResNet50

◦ Remove the 3 last residual blocks

◦ Replace with the following architecture

CNN Pose Regression Custom residual block

CNN Pose Regression ∆𝑋𝑗 , ∆𝑌𝑗 , ∆𝑍𝑗: Parent-relative location-maps

𝐵𝐿: Bone lengths ◦ 𝐵𝐿𝑗 = ∆𝑋𝑗⨀∆𝑋𝑗 + ∆𝑌𝑗⨀∆𝑌𝑗 + ∆𝑍𝑗⨀∆𝑍𝑗

CNN Pose Regression Image features

CNN Pose Regression Concatenate intermediate features

◦ Idea: Parent-relative positions and bone lengths help guide the network

CNN Pose Regression Final heatmap 𝐻𝑗 and location-maps 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗

CNN Pose Regression Now we can extract 2D keypoints and 3D joint positions

◦ Keypoint position 𝐾𝑗 simply given by max value in heatmap

◦ 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 are read from 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗 at their respective 𝐾𝑗

Training Loss term

𝐿𝑜𝑠𝑠 𝑥𝑗 = 𝐻𝑗

𝐺𝑇⨀ 𝑋𝑗 − 𝑋𝑗𝐺𝑇

where ⨀ is an element-wise multiplication of the left and right matrix

Enforce that we are only interested in 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 at the respective 𝐻𝑗 ’s 2D location ◦ That is, the loss should be weighted stronger

around the joint’s 2D location

Temporal Filtering 2D keypoints and 3D positions are filtered with the 1 Euro filter [Casiez et al. 2012] ◦ A temporal smoothing filter

Kinematic Skeleton Fitting 1. Retarget skeleton to the underlying model

2. Fit final skeleton using the Levenberg-Marquardt algorithm

Kinematic Skeleton Fitting Final skeleton 𝑃𝑡

𝐺 = 𝑃𝑡𝐺(𝜃, 𝑑) parameterized by 𝜃 and 𝑑

◦ 𝜃: Vector of joint angles

◦ 𝑑: Root joint’s location in camera space

Non-linear optimization problem

Kinematic Skeleton Fitting Objective energy

𝐸𝑡𝑜𝑡𝑎𝑙 𝜃, 𝑑 = 𝐸𝐼𝐾 𝜃, 𝑑 + 𝐸𝑝𝑟𝑜𝑗 𝜃, 𝑑 + 𝐸𝑠𝑚𝑜𝑜𝑡ℎ 𝜃, 𝑑 + 𝐸𝑑𝑒𝑝𝑡ℎ(𝜃, 𝑑)

Idea: Fit a skeleton which minimizes this energy ◦ That is, solve the minimization problem

𝑎𝑟𝑔𝑚𝑖𝑛𝜃,𝑑 𝐸𝑡𝑜𝑡𝑎𝑙 𝜃, 𝑑

Kinematic Skeleton Fitting The inverse kinematics term

𝐸𝐼𝐾 𝜃, 𝑑 = 𝑃𝑡

𝐺 − 𝑑 − 𝑃𝑡𝐿2

◦ 𝑑: Skeleton root

◦ 𝑃𝑡𝐺: Final 3D pose

◦ 𝑃𝑡𝐿: Predicted 3D pose

Kinematic Skeleton Fitting The projection term

𝐸𝑝𝑟𝑜𝑗 𝜃, 𝑑 = Π 𝑃𝑡

𝐺 − 𝐾𝑡 2

◦ Π ∘ : Projection function from 3D to the 2D image plane

◦ 𝑃𝑡𝐺: Final 3D pose

◦ 𝐾𝑡: Predicted 2D keypoints

Kinematic Skeleton Fitting The smoothing term

𝐸𝑠𝑚𝑜𝑜𝑡ℎ 𝜃, 𝑑 = 𝑃𝑡𝐺 2

◦ 𝑃𝑡𝐺 : Acceleration of 𝑃𝑡

Kinematic Skeleton Fitting The depth term

𝐸𝑑𝑒𝑝𝑡ℎ 𝜃, 𝑑 = 𝑃𝑡𝐺 𝑧 2

◦ 𝑃𝑡𝐺 : Velocity of 𝑃𝑡

◦ 𝑃𝑡𝐺 𝑧: Z-component of the 3D velocity

Kinematic Skeleton Fitting Apply the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) to obtain final pose 𝑃𝑡

𝑎𝑟𝑔𝑚𝑖𝑛𝜃,𝑑 𝐸𝑡𝑜𝑡𝑎𝑙 𝜃, 𝑑

Training

About Training CNN regressor is the only part that needs training

How to train network to predict keypoints and location-maps?

About Training The network is pretrained for 2D pose estimation on MPII and LSP

◦ MPII ◦ 25K images containing over 40K people with annotated body joints

◦ Wide range of activities

◦ LSP ◦ 2K images of sports activities

The paper does not go into details on how this pretraining is done

About Training Train for 3D pose estimation using MPI-INF-3DHP and Human3.6m

◦ MPI-INF-3DHP ◦ Generated using multi-view markerless motion capture system

◦ From all 14 cameras there are ~1.3𝑀 frames

◦ Captured on a greenscreen with background augmentation

◦ Use data from 5 chest-high cameras, 2 head-high cameras and 1 knee-high camera

◦ The sampled frames have at least one joint move by > 200mm between them

◦ Human3.6m ◦ 3.6 million 3D human poses and corresponding images

◦ 11 professional actors

◦ 17 scenarios

Results

https://youtu.be/W1ZNFfftx2E

Comparisons Comparison to state of the art on MPI-INF-3DHP test set using ground-truth bounding boxes: ◦ PCK: Percentage of Correct Keypoints (3D joint positions)

◦ AUC: Area Under the Curve

◦ MPJPE: Mean Per Joint Position Error (mm)

Comparisons Overall better pose quality

◦ Particularly for end effectors

◦ Occasional large mispredictions

Comparisons Using 3D pose vastly improves PCK

Additional improvement from filtering and combining 2D and 3D constraints

Limitations Self-occlusion

Poses far from the training data are hard

Multiple people ◦ Lack of training data

Occluded faces

Fast motion

High-end hardware

Related Work DensePose [Güler et al. 2018]

◦ Surface-based representation of human pose

◦ Using recurrent neural network and ROI-Align pooling to obtain part labels

◦ Multi-person

https://youtu.be/Dhkd_bAwwMc

Summary Real-time 3D pose estimation from single RGB camera

Better than offline, state-of-the-art solutions in some categories

Temporal filtering and skeleton fitting improves quality

Limited availability of annotated datasets

VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early...

Documents

3D Human Body Pose Estimation from Monocular Video Moin Nabi Computer Vision Group Institute for Research in Fundamental Sciences (IPM)

3D Human Pose Estimation from Monocular Images with Deep ...vigir.missouri.edu/~gdesouza/Research/Conference... · that deep neural networks can be applied to 3D human pose estimation

Pose Estimation of Uncooperative Spacecraft using ...web.stanford.edu/~sharmas/publications/poseEstimation.pdf · Pose Estimation of Uncooperative Spacecraft using Monocular Vision

Model-Based 3D Hand Pose Estimation from Monocular Videofleet/research/Papers/handTrackerPAMI.pdf · Abstract—A novel model-based approach to 3D hand tracking from monocular video

Monocular Real-Time 3D Articulated Hand Pose Estimationjrgn/2009_humanoids_rkk.pdf · Monocular Real-Time 3D Articulated Hand Pose Estimation ... tracking the hand in the high dimensional

Monocular 3D Human Pose Estimation In The Wild Using ... · Monocular 3D Human Pose Estimation In The Wild Using Improved CNN Supervision Dushyant Mehta1, Helge Rhodin2, Dan Casas3,

Learning Monocular 3D Human Pose Estimation From Multi-View …openaccess.thecvf.com/content_cvpr_2018/papers/Rhodin... · 2018. 6. 11. · Learning Monocular 3D Human Pose Estimation

Deep Kinematics Analysis for Monocular 3D Human Pose

Unsupervised Learning of Monocular Depth Estimation and ...openaccess.thecvf.com/content_cvpr_2018/...Learning... · Unsupervised Learning of Monocular Depth Estimation and Visual

Feature selection and pose estimation from known planar ...Feature Selection and Pose Estimation From Known Planar Objects Using Monocular Vision Shengdong Xu1 and Ming Liu2 1ETH Zurich,

It’s all Relative: Monocular 3D Human Pose Estimation from ... · Recent success in 2D pose estimation has been driven by larger, more varied, labeled datasets. While laborious,

Posebits for Monocular Human Pose Estimationfleet/research/Papers/posebits_cvpr2014.pdfpose ambiguities and therefore facilitate the difﬁcult task of image-based 3D pose estimation

Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down … · 2021. 6. 11. · Moon et al. [32] propose a top-down 3D multi-person pose-estimation method that can estimate

Monocular 3D Human Pose Estimation by Predicting Depth …openaccess.thecvf.com/content_ICCV_2017/papers/Nie_Monocular_3D… · This paper aims at estimating full-body 3D human pos-es

Monocular 3D Human Pose Estimation by Generation and Ordinal …openaccess.thecvf.com/content_ICCV_2019/papers/Sharma... · 2019-10-23 · Monocular 3D Human Pose Estimation by Generation

Probabilistic Monocular 3D Human Pose Estimation with

Depth Estimation from Monocular Image with Sparse Known …sunw.csail.mit.edu/abstract/Depth_Estimation_from.pdfDepth Estimation from Monocular Image with Sparse Known Labels Yaoyu

Simultaneous Pose Estimation and Augmentation of Elastic … · 2020-04-18 · Simultaneous Pose Estimation and Augmentation of Elastic Surfaces from a Moving Monocular Camera Nazim

Weakly-supervised 3D Hand Pose Estimation from Monocular RGB …openaccess.thecvf.com/content_ECCV_2018/papers/Yujun_Cai... · 2018-08-28 · Weakly-supervised 3D HandPose Estimation

End-to-end 6DoF Pose Estimation from Monocular RGB Images