8
Multi-Task and Multi-Level Detection Neural Network Based Real-Time 3D Pose Estimation Dingli Luo *† , Songlin Du * and Takeshi Ikenaga * * Graduate School of Information, Production and Systems, Waseda University, Kitakyushu 808-0135, Japan University of Electronic Science and Technology of China, Chengdu 611731, China [email protected] Abstract—3D pose estimation is a core step for human- computer interaction and human action recognition. However, time-sensitive applications like virtual reality also need this task to achieve real-time speed. This paper proposes a multi- task and multi-level neural network architecture with a high- speed friendly 3D human pose representation. Based on this, we build a real-time multi-person 3D pose estimation system with a single RGB image as input. The network estimates 3D poses from the input image directly by the multi-task design and keeps both accuracy and speed by the multi-level detection design. By evaluation, we show our system achieves the 21 fps on RTX 2080 with only 33 mm accuracy lose compared with related works. We also provide network visualization to prove our network work as we design. This work shows the possibility for a single RGB image based 3D pose estimation system to achieve real-time speed, which is a basement for building a low-cost 3D motion capture system. I. Introduction Human pose estimation is an important step for the computer to understand human behavior. Because com- pared with image or video, abstract 2D or 3D human poses are much easier to analyze. Some works [1], [2] estimate human 2D pose by marking joints’ 2D positions for each human on the given input image. Farther more, estimating human pose by giving joints’ 3D position in 3D space is a more difficult task. However, human poses in 3-dimensional space contain more information and are not affected by camera projection. Considering this, there are also many works that try to do this. Recent approaches estimate the human’s 3D pose in 4 kinds of ways: based on single 2D image input [3], based on multiple images with calibration [2], based on video sequences [4], based on other 2D pose estimation system [5]. Image and video- based methods are more complex and hard to design. However, compared with 2D estimation system based methods, they avoid the time cost and accuracy loss from other systems. For the systems using more than one camera, they always provide more accurate result because of less occlusion. However, this also limited by cost consideration and usage environment. Our approach only needs one monocular RGB camera without depth and focus on game or entertainment usage like Kinect from Microsoft. This means we want to provide a real-time multi-person 3D pose estimation system with a single RGB image as input. Although this system has a limited detection range and do not perform well on a complicated situation which is the same as Kinect, we only need a regular RGB camera, and our system is much cheaper for customers. To achieve this goal, a new multi-task and multi-level pose estimation neural network and its corresponding representation is proposed. For the multi-task, it takes RGB image as input and produces 2D joint position, depth information and connection information in the same pro- cess. And the multi-level structure avoids repeat process and keep the accuracy. Then a post-processing system generates all human poses based on this information. We proposed a representation of joints as a bridge between neural network and the traditional algorithm, which is easier for both learning and post-processing. After the design part, in order to train this neural network and overcome the lack of 3D dataset in wild space, we provide a mixed training design with a real-life 2D dataset and virtual 3D dataset captured from the game. Finally, we validate the effectiveness by using a virtual dataset to train a task for real-life. We also provide the light-weight neural network design for real-time speed. Our post-processing system design overcomes the lost accuracy. Our system is general for using in house and the wild. We trained this system on the MSCOCO 2014 [6] and JTA dataset [7]. Then we evaluated that and it can achieve 112mm MPJPE in 21 FPS on GTX 2080ti. We also provide detail profile for an explanation of our neural network. This is important for industry usages. II. Related Work A. 2D Real-Time Human Pose Estimation Although we design for 3D human pose estimation system, we learn many ideas from current 2D human pose estimation systems like the encoding method and the network design. 2D human pose estimation system takes image or video as input and gives the information of each human and its 2D positions of each human’s joints. In the beginning, there are image feature based pose estimation systems like [8]. But nowadays, deep learning based methods like Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China 1427 978-988-14768-7-6©2019 APSIPA APSIPA ASC 2019

Multi-Task and Multi-Level Detection Neural Network Based

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-Task and Multi-Level Detection Neural Network Based

Multi-Task and Multi-Level Detection NeuralNetwork Based Real-Time 3D Pose Estimation

Dingli Luo∗† , Songlin Du∗ and Takeshi Ikenaga ∗∗ Graduate School of Information, Production and Systems, Waseda University,

Kitakyushu 808-0135, Japan† University of Electronic Science and Technology of China,

Chengdu 611731, [email protected]

Abstract—3D pose estimation is a core step for human-computer interaction and human action recognition. However,time-sensitive applications like virtual reality also need thistask to achieve real-time speed. This paper proposes a multi-task and multi-level neural network architecture with a high-speed friendly 3D human pose representation. Based on this,we build a real-time multi-person 3D pose estimation systemwith a single RGB image as input. The network estimates 3Dposes from the input image directly by the multi-task designand keeps both accuracy and speed by the multi-level detectiondesign. By evaluation, we show our system achieves the 21 fpson RTX 2080 with only 33 mm accuracy lose compared withrelated works. We also provide network visualization to proveour network work as we design. This work shows the possibilityfor a single RGB image based 3D pose estimation system toachieve real-time speed, which is a basement for building alow-cost 3D motion capture system.

I. IntroductionHuman pose estimation is an important step for the

computer to understand human behavior. Because com-pared with image or video, abstract 2D or 3D humanposes are much easier to analyze. Some works [1], [2]estimate human 2D pose by marking joints’ 2D positionsfor each human on the given input image. Farther more,estimating human pose by giving joints’ 3D position in 3Dspace is a more difficult task. However, human poses in3-dimensional space contain more information and are notaffected by camera projection. Considering this, there arealso many works that try to do this. Recent approachesestimate the human’s 3D pose in 4 kinds of ways: basedon single 2D image input [3], based on multiple imageswith calibration [2], based on video sequences [4], basedon other 2D pose estimation system [5]. Image and video-based methods are more complex and hard to design.However, compared with 2D estimation system basedmethods, they avoid the time cost and accuracy lossfrom other systems. For the systems using more thanone camera, they always provide more accurate resultbecause of less occlusion. However, this also limited bycost consideration and usage environment.

Our approach only needs one monocular RGB camerawithout depth and focus on game or entertainment usagelike Kinect from Microsoft. This means we want to providea real-time multi-person 3D pose estimation system with

a single RGB image as input. Although this system hasa limited detection range and do not perform well on acomplicated situation which is the same as Kinect, weonly need a regular RGB camera, and our system is muchcheaper for customers.

To achieve this goal, a new multi-task and multi-levelpose estimation neural network and its correspondingrepresentation is proposed. For the multi-task, it takesRGB image as input and produces 2D joint position, depthinformation and connection information in the same pro-cess. And the multi-level structure avoids repeat processand keep the accuracy. Then a post-processing systemgenerates all human poses based on this information. Weproposed a representation of joints as a bridge betweenneural network and the traditional algorithm, which iseasier for both learning and post-processing. After thedesign part, in order to train this neural network andovercome the lack of 3D dataset in wild space, we providea mixed training design with a real-life 2D dataset andvirtual 3D dataset captured from the game. Finally, wevalidate the effectiveness by using a virtual dataset to traina task for real-life. We also provide the light-weight neuralnetwork design for real-time speed. Our post-processingsystem design overcomes the lost accuracy.

Our system is general for using in house and the wild.We trained this system on the MSCOCO 2014 [6] andJTA dataset [7]. Then we evaluated that and it canachieve 112mm MPJPE in 21 FPS on GTX 2080ti. Wealso provide detail profile for an explanation of our neuralnetwork. This is important for industry usages.

II. Related WorkA. 2D Real-Time Human Pose Estimation

Although we design for 3D human pose estimationsystem, we learn many ideas from current 2D humanpose estimation systems like the encoding method andthe network design.

2D human pose estimation system takes image or videoas input and gives the information of each human andits 2D positions of each human’s joints. In the beginning,there are image feature based pose estimation systemslike [8]. But nowadays, deep learning based methods like

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1427978-988-14768-7-6©2019 APSIPA APSIPA ASC 2019

Page 2: Multi-Task and Multi-Level Detection Neural Network Based

DeepPose [9] has been widely used to achieve higheraccuracy. Not only this, but some of these methods alsoconsider to achieve real-time speed. In order to achieveboth high speed and high accuracy, researchers think fromtwo directions.

a) Bottom-top solution: This kind of solutions detecteach joint first, then connects corresponding joints intohumans. These methods are summarised into ’Bottom-upsolutions.’ Most of the works treat joint position detectionas given the probability in the current pixel position. Thedifficult problem is that the method to connect jointsbelongs to one person. For example, Open Pose [2] fromCMU uses part affinity fields to help the connection.Another one [10] uses the parent joint offset to solvethis, which is inspired us to design our representation.Some researchers [11] also use clustering to merge all jointsbelongs to the same person. Besides, by treating humanpose as Graphical Model, deep graph neural networks [12]are used to analyze the possible relationship between jointsto find out the belonging [13]. Bottom-top solutions arenaturally easy to detect multiple persons in image andcan achieve stable high speed.

b) Top-bottom solution: This kind of solutions detecteach human in input first by object detection methods likeMask-RCNN [14] and crop each person into image parts.Then it does pose estimation for each human individually.These methods are called “Top-bottom” solutions. Sincethere is only one person in an image, the connectionof joints is easier to deal with. So these methods areused to achieve high accuracy requirement. However, thespeed depends on the object detection system and alsobe affected by human numbers. Alpha Pose [15], [16] is afamous top-bottom solutions.

B. 3D Real-time Human Pose EstimationCompared with image-space 2D human pose estimation,

3D human pose estimation is a much harder problem. Inindustry, markers and multi-camera calibration based 3Dhuman pose systems [17] are widely used in motion captureby CG companies. Since the requirement of clothes,markers, spaces and camera number, this is really hardto use as a customer-level solution. Many works try toprovide solutions for lower cost, less camera and fasterspeed. Depending on the input, they can be divided intofour kinds.

a) Single 2D image based: These methods only needone camera as input, which means less cost and easydeployment. However, estimating 3D joint position basedon 2D image space is actually an under-determinedproblem because there is more than one answer. Usingbinocular camera [18] or depth camera [19] to get depthinformation is one solution. On the other hand, consideringthe human body structure, even only one RGB camera cancalculate the human 3D pose with the highest probability.For example, single-shot multi-person 3D pose estimation[3] system provides an end-to-end neural network which

can provide 3D joints positions. Also some works [20]provide 3D geometry models based on parametric human3D models [21]. All these works show the possibility ofestimating 3D human pose based on monocular RGBcamera input.

b) Camera calibration based: Camera calibration [22]calculates 3D position based on captured images fromdifferent directions. This is widely used in SLAM systems[23]. Since this method is much easier for extendingexisted 2D human pose estimation system into 3D poseestimation, many 2D estimation works [2] [16] choosethis to provide a 3D pose estimation solution. However,this kind of method needs more than one cameras frommore than one angles, which is hard to use in the wildenvironment.

c) Video sequence based: The movement also pro-vides depth information because movement speed is vari-ant based on the distance to the camera. Also, an actionsequence contains more information for neural networksto understand human body structure. So many works[24] choose video as input. Some works need another 2Destimation network to transfer the video into a 2D motionsequence, then use to estimate 3D pose. VideoPose3D [4]from Facebook AI is a recent example. These methodscan achieve higher accuracy compared with image-basedmethods. But the processing time of pre-processing andthe latency from collecting frames for input makes thespeed hard to achieve real-time.

d) 2D pose based: To avoid the complexity of colorfulimage, many works take 2D poses as input which is pro-duced from other 2D human pose estimation methods. Forexample, 3D pose baseline [5] takes one human’s 2D posedirectly then output the 3D pose with joints 3D position.Both the input and output is vector instead of image.To improve the quality. Since the pose vector samples iseasy to construct automatically, self-supervised [25] andunsupervised methods [26] are also wildly used to improvethe accuracy. However, Converting from the image to theabstract 2D pose causes color information lost. This makesit difficult to benefit from color information to fix the 3Dposition.

III. Multi-Task and Multi-Level Network for Real-TimeMulti-Person 3D Pose Estimation

We design a simple pipeline with one RGB image asinput and the 3D pose as output. As shown in Fig. 1, ourtotal pipeline contains only 2 step: a neural network pro-cesses the input RGB image and output 2D joint position,connection information and joint depth information inthe same time, then a post-processing system combine allthings together to generate 3D pose result. Compared witha typical 3D pose estimation system ’3D pose baseline’ [5],our work uses a specially designed middle representationto avoid losing depth information. This design makes oursystem finish both 2D and 3D detection in one networkwhich means the speed is higher.

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1428

Page 3: Multi-Task and Multi-Level Detection Neural Network Based

InputImage

2D JointProbability

2D OffsetProbability

Joint Depth

CNN

OneNetwork

PostProcess

Final3D Pose

InputImage

2D Pose

CNN

2D PoseNetwork

Final3D Pose

NN

3D PoseNetwork

Fig. 1. Concept difference between ours and a typical solution

In this section, we first talk about the representation ofpose in our system which determines neural network andpost-processing design. Then we talk about each step inour pipeline. A multi-task neural network is used to output2D pose information and 3D information in the same time.And a multi-level detection structure is applied to keepthe speed and accuracy. The second post-processing stepdeals with detection, linking and automatically matchingthe pose in 2D image space and depth in 3D world space.

A. Relative Depth Based Pose RepresentationThe pose representation is like a bridge between the

neural network and the post-processing system. So thedefinition of representation highly effect both the neuralnetwork design and the post-processing system design. Inour situation, we use CNN as basic module which meansthe outputs are images. So we define an image basedrepresentation for all human poses.

There are many kinds of representation. Some re-searches [5] directly represent each 3D joint position inthe unified 3D space. “unified” means the positions arenot related with camera position. Some other researchersconsider the pose as a set of parameters to a parametricbody model like SMPL [21].

In our case, we consider a human pose as a di-rectional graph. In this graph, each node is corre-sponding to a human joint j which contains theseinformation:(Classj , Xj , Yj , Dj , OffsetXj , OffsetYj) asshown in Fig. 2, The Classj means the current joint’sclass. The (Xj , Yj) means the position of current jointj.The Dj shows the joint depth which is the distance tocamera. The (OffsetXj , OffsetYj) is target position ofcurrent joint’s parent joint. In order to simplify and speedup, we want to make the task of the network as easy aspossible. So in our case, we encode the 3D joint positionsby two parts: 2D positions(Xj , Yj) in image space andrelative depth(D) in world space. The 2D positions can

J

Y

X

Depth

Offset

RelativeDepth

Fig. 2. The encoding method of our pose.

be reused from pose linking process. So the one channeldepth(D) is the only additional information needed. Basedon the network output, the post-process does a mappingF to transfer to the final camera-related 3D space:

F (Xj , Yj , D) −→ (X3D, Y3D, Z3D). (1)

For the depth, we encode depth value as relative depth.We consider the human pose as a tree structure with thehead as the root node. For each joint, the connected jointwhich is near to head will be the parent node, and thefar joints will be the children. Then the relative depth isencoded as the current joint’s depth to the parent joint’sdepth. We make sure each joint only have one parent joint,otherwise, the relative depth is not uniqueness. Comparedwith the absolute depth, the relative depth is much easierto learn: it is not related to the human position and onlyneeds local area information.

B. Multi-Task and Multi-Level 3D Detection NetworkThe design target of our network is keeping the accuracy

and achieve high speed. Based on this, we provide a newmulti-task and multi-level 3d detection network architec-ture. As we show in Fig. 3, our network architecturecontains three parts: the feature pyramid network, 2Ddetection branch and depth detection branch. The depthdetection branch looks almost the same with 2D detectionexcept for the map generation module. In each detectionbranch, we do detection in different feature level thenconcatenate together. Finally, a map detection moduleanalysis the concatenate result and output the final map.

1) Multi-Level Detection Network Backbone: The con-volutional neural network usually has the best detectiontarget size. So in order to adapt different target size, anormal way is processing the input image many timeswith different scales. However, in our case, processingthe input image more than one time takes a huge timecost. Instead, we use feature pyramid network [27] basemulti-level architecture to detect in multiple scales by thenetwork structure. This structure can adapt sizes of thetarget without higher the calculation cost too much. Thisis saved without causing accuracy loss because the saved

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1429

Page 4: Multi-Task and Multi-Level Detection Neural Network Based

input

192

conv1 8

96

8

96

32

48

64

24

128

12

256

6Conv2

256

12

128

12

Deconv1

384

24

192

24

Deconv2

48

160

48

320 Deconv3224

12 24 48

24 48

48

48

Concat4 240

48

Heat Map Generation

Module

Heat Map

48

GenerationModule

Relative Map RelativeMap

12

Detect Level 1

24 48

24

Detect Level 2

48

48

Detect Level 3

48

Concat5 240

48

Depth Map Generation

Module

Depth Map

DepthGenerationTask

2DDetectionTask

Concat1Concat2

Concat3

Fig. 3. The overall map of our network architecture.This shows our network architecture. Firstly, a ResNet34-like backbone processes the input image. Then it is followed by a

Deconvolution and concatenate layer. These two structures build a U-Net structure. Then there are two separate branches. One is for 2Ddetection to output 2D probability map and 2D offset map. The other one is for the 3D depth map. Each detection branch is constructedby a multi-level detection module, a concatenate layer and a map generation module. For the detailed structure, please check Fig. 4

calculation cost is redundancy. For each area in the inputimage, there is only one best detection level, which meansthe calculation of other detection levels is not needed.

In detail, we use ResNet34 [28] as the front part butcompress the channels into a half. The detailed structureis shown in Table. I. After this, there is an additionalconvolution layer. Then it is followed by three pairs ofdeconvolution layer and concatenate layer.

2) Detection Module and Map Generate Module: Wedesign a common structure to use in different levelscalled detection module and another small module calledMap Generate Module to generate the target map. Thestructure is shown in Fig. 4. Take the map generate moduleas an example. A convolution layer is applied to the inputfeatures in order to decrease the channel number. Thenit is processed by a residual structure to extract high-quality features. Finally, in order to merge feature mapsfrom different levels, we must make the size of featuremaps become the same. We use deconvolution layer orconvolution transpose layer. Each time the feature mapprocessed by deconvolution layer, the map’s channel sizebecomes a half. Compared with using pooling layers, this

saves calculation.All these modules are designed based on the same idea of

ResNet [28]. It starts with a 1× 1 convolution to smallerthe channel number. Then it uses a 3 × 3 convolution.Finally, it uses a 3× 3 convolution to recover the originalchannel number. The cost is lower than directly doing a3× 3 convolution.

3) Loss Design: As a multi-task neural network, weneed to carefully design the loss function in order tobalance each different branch. Otherwise, one branch maybe weaker than others.

In our case, for classification branch, we directly use L2loss instead of soft maxed cross entropy loss. Then for theother two branches, we only focus on the position wherehas a joint instead of the background. For the backgroundarea, we directly ignored the loss and allowed the branchesto output any result. We find out this makes the task easierand the network can achieve lower loss.

IV. Post-ProcessingThe output of the network is three multi-channel maps.

We need to use post-processing to get the final 3D pose.

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1430

Page 5: Multi-Task and Multi-Level Detection Neural Network Based

Our post-processing progress constructed by four steps:1) Joint Detection: Use a convolution pass to find out

the local maximum point in the output heat map.2) Joint Linking: Generate a set of joint pairs based on

a heat map and relative map. Then sort the jointpairs and calculate the possible joint connection withthe highest total probability.

3) Depth Adapting: Calculate the relative scale ofworld space depth. This process makes the mismatchcoordinate space to the same coordinate space.

4) Temporal filter: Depending on the application, ifneeded, a temporal filter will be used to smooththe 3D pose result.

A. Distance Based Joint Linking

In order to link the joints together, we first calculatethe target position:

(TargetXj , TargetYj) = (Xj+OffsetXj , Yj+OffsetYj)

. Then we search a circle area with (TargetXj , TargetYj)as center and Rsearch as radius. If there is a correspondingjoint with the right class inside the circle, we link currentjoint to this joint. We calculate the search radius based onthe target distance because the farther the parent joint,the higher the error will be. We calculate the Rsearch basedon

l =√OffsetX2

j , OffsetY2j , (2)

Rsearch = max(l ∗ α+ (1− α), 1). (3)

In practice, we set the α as 0.3. We find out this makeshigher accuracy compared with the fixed joint searchradius.

TABLE IResNet-like Backbone Structure

layer name output size 34-layerinput 384× 384conv1 192× 192 7× 7, 8, stride 2

conv2_x 96× 96

3× 3 max pool, stride 2[3× 3, 323× 3, 32

]× 3

conv3_x 48× 48

[3× 3, 643× 3, 64

]× 4

conv4_x 24× 24

[3× 3, 1283× 3, 128

]× 6

conv5_x 12× 12

[3× 3, 2563× 3, 256

]× 3

+

Deconv...

Concate with otherdetect module

Activation

+

Output Map

Detection Module Map Generate Module

Conv 2D 1x1Batch Normalization

Leaky Relu

Conv 2D 3x3

Batch Normalization

Leaky Relu

Conv 2D 3x3Batch Normalization

Leaky Relu

Conv 2D 1x1

Batch Normalization

Leaky Relu

Conv 2D 1x1Batch Normalization

Leaky ReluConv 2D 3x3

Batch NormalizationLeaky Relu

Conv 2D 1x1

Batch Normalization

Leaky Relu

Conv 2D 3x3

Batch Normalization

Leaky Relu

Conv 2D 1x1Batch Normalization

Leaky ReluConv 2D 3x3

Batch NormalizationLeaky Relu

Conv 2D 1x1

Batch Normalization

Fig. 4. Detection module and map generate structureThe detection module and the map generate module is constructed

by the same residential structure. This structure is made by aConv 2D 1× 1 to decrease the channels into a half. Then a Conv2D 3× 3 is applied to do convolution. Finally, a Conv 2D 1× 1 is

used to recover the original channel number. The detection moduleuses this basic structure twice and the map generate module onlyuse once. In order to concatenate with other levels output, a set ofDeconv layers are used to make the output feature map have the

same width and height.

B. Regression Based Automatic Depth Scale

A core part is how to adapt the depth scale based on2D detected result. We want to scale the depth with avalue Scaledepth to make the coordinate in the same unit.In theory, the depth is based on camera position. Andthe 2D coordinate of joint also needs camera information.This means without camera information, we cannot mapthe world space depth to camera space.

depth =√(Px − Cx)2 + (Py − Cy)2 + (Pz − Cz)2, (4)uxuy

1

=

fx 0 cx0 fy cy0 0 1

r11 r12 r13 t1r21 r22 r23 t2r31 r32 r33 t3

XY1

. (5)

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1431

Page 6: Multi-Task and Multi-Level Detection Neural Network Based

However, we find out a map between the human size anddepth scale. This assumes the human size is normal. Therelationship between human 2D size (Width,Height) andScaledepth is strong both proved by linear regression andvisualization.

Then we use a fast calculation to get the Scaledepth andmultiply to the Depth. Then we can get the right humanjoint position in camera space without getting camerainformation.

V. ExperimentsA. Training

A huge problem for training 3D pose estimation neuralnetwork is the dataset. Capturing high-quality 3D jointsdata usually need to do in the studio and the charactersneed to wear special clothes with markers. This means itis tough to get 3D pose dataset in the wild environment.However, it is much easier to get the 3D pose from thegame because it is needed to render the character. So wechoose to use a virtual 3D pose dataset called JTA dataset[7]. Although the input image’s realistic is not as good asthe dataset from real life, we proved that with carefullydesigned training progress and data augmentation, a gamedataset is also suitable for training a neural network whichwill be used in real life.

We train the 2D detection branch of the network withMSCOCO 2014 [6] dataset first to initialize the network.Then, we randomly choose a sample from 2D dataset or3D dataset each time. We recognized this training progressmakes the network training more stable and decrease thetraining time. We train the whole network on one NvidiaGTX 1080ti for 72 hours.

B. ResultsWe evaluated our method on JTA dataset test set. We

use MPJPE (mean per joint position error) to measureour network’s performance. The result is shown in Table.II. Since we want to test the image based performance, wedo not use temporal smoother. Since there are not manyimage-based 3D pose solution also considering speed, wecompared with some other solutions. For other RGB imagebased solutions, they do not target for speed and cannotachieve real-time. Some methods also used 2D pose, butit takes a 2D pose as input. This means it needs a pre-processing by another 2D pose network. This means thequality and the speed of the pre-processing method highlyaffect the performance of the followed 3D pose estimationmethods. We test these methods by adding a typical 2Dpose estimation system OpenPose [2] to measure the speedand accuracy.

However, many cases of the test dataset are not suitablefor our design target. Like there are crowd peoples in adifferent size. So we also test our work on a test set moresimilar to game or entertainment. The result images arelisted in Fig. 5.

By analysis, we find out most error is happened onthe foot. Considering about our relative depth encodingmethod, the farther the joint to the root node, the higherthe error will be.

C. Network profile and visualizationWe design this neural network architecture based on the

idea of multi-level detection. However, it is hard to provethe neural network acts as we designed because there isn’tany manually writing code or logic. In order to prove theneural network is acted as we designed, we use networkprofile and visualization to prove this.

To prove different level have different focus area, wedump the output feature maps in each level, then findout the area with the highest output in total.

MapActivation(x, y) =

∑MapFeature(x, y, i)

max(MapFeature(x, y)). (6)

The higher activation means the current level has higherinterest on this area. Like we show in Fig. 6, we find outa different level of detection path do as we want. Thehigh level with small detection receptive field focuses onsmall targets like necks and human heads far away fromthe camera. The low level with large detection receptivefield focuses on large body parts near to the camera.The middle level takes responsibility for other areas. Theinterest detects areas in different level do not cover eachother. This proves our network act as we design to do.

VI. ConclusionsThis paper proposed a system to do real-time multi-

person 3D pose estimation. It contains a multi-task andmulti-level detection neural network to directly take an

TABLE IIPerformance compare

Methods InputMPJPE(mm)

Speed(ms)

Zhou et al.[24] Video 64.9 -Martinez et al.[5] 2D Pose 62.9 -

Open Pose[2](high accuracy mode)+ Martinez et al.[5]

RGBImage

70.5 833.3

Open Pose[2](high speed mode)+ Martinez et al.[5]

RGBImage

82.7 113.6

LCR-Net[29]RGBImage

87.7 -

OursRGBImage

112.9 46.5

- Means this work is not designed for speed or cannotmeasure by FPS. For example, the 2D pose basedwork’s speed depends on the pre-processing system’sspeed.

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1432

Page 7: Multi-Task and Multi-Level Detection Neural Network Based

Fig. 5. Some results of our work, include single-person and multi-person pose estimation result.

Input Image High level Middle level Low level

Fig. 6. Activation area for different detection level.This image shows the division of work for different levels: The

high-level focus on small detection targets like neck and humanhead far away from the camera. The middle-level focus on mostparts of the image. The low-level focus on the large parts in the

image like the human arm of the person near to the camera.

image as input and output 2D joint position, 2D jointlinking and 3D joint depth information. Then a post-processing progress is used to construct 3D pose for eachhuman. With only 33 mm accuracy down, our RGB imagebased solution achieves 21 fps speed on RTX 2080. This is

an affordable accuracy for game and entertainment. Oursystem shows the potential of using a normal RGB camerato do real-time 3D pose estimation for multiple people ispossible.

Acknowledgment

This work was supported by Waseda University Grantfor Special Research Projects (2019C-581)

References[1] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-

person 2d pose estimation using part affinity fields,” in CVPR,2017.

[2] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh,“OpenPose: realtime multi-person 2D pose estimation usingPart Affinity Fields,” in arXiv preprint arXiv:1812.08008, 2018.

[3] D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Sridhar,G. Pons-Moll, and C. Theobalt, “Single-shot multi-person 3dpose estimation from monocular rgb,” in 2018 InternationalConference on 3D Vision (3DV), pp. 120–130, IEEE, 2018.

[4] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3dhuman pose estimation in video with temporal convolutions andsemi-supervised training,” in Conference on Computer Visionand Pattern Recognition (CVPR), 2019.

[5] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simpleyet effective baseline for 3d human pose estimation,” in ICCV,2017.

[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Commonobjects in context,” in Computer Vision – ECCV 2014 (D. Fleet,T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), (Cham), pp. 740–755, Springer International Publishing, 2014.

[7] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, andR. Cucchiara, “Learning to detect and track visible and occludedbody joints in a virtual world,” in European Conference onComputer Vision (ECCV), 2018.

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1433

Page 8: Multi-Task and Multi-Level Detection Neural Network Based

[8] T. Vatahska, M. Bennewitz, and S. Behnke, “Feature-basedhead pose estimation from images,” in 2007 7th IEEE-RASInternational Conference on Humanoid Robots, pp. 330–335,IEEE, 2007.

[9] A. Toshev and C. Szegedy, “Deeppose: Human pose estimationvia deep neural networks,” in The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), June 2014.

[10] S. D. Dingli Luo and T. Ikenaga, “End-to-end feature pyramidnetwork for real-timemulti-person pose estimation,” in MVA,2019.

[11] X. Nie, J. Feng, J. Xing, and S. Yan, “Pose partition networks formulti-person pose estimation,” in Proceedings of the EuropeanConference on Computer Vision (ECCV), pp. 684–699, 2018.

[12] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learn-ing graph representations,” in Thirtieth AAAI Conference onArtificial Intelligence, 2016.

[13] X. Chen and A. L. Yuille, “Articulated pose estimation bya graphical model with image dependent pairwise relations,”in Advances in Neural Information Processing Systems 27(Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, andK. Q. Weinberger, eds.), pp. 1736–1744, Curran Associates, Inc.,2014.

[14] W. Abdulla, “Mask r-cnn for object detection and instancesegmentation on keras and tensorflow,” 2017.

[15] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regionalmulti-person pose estimation,” in ICCV, 2017.

[16] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu, “Pose Flow: Efficientonline pose tracking,” in BMVC, 2018.

[17] A. Sharifi, A. Harati, and A. Vahedian, “Marker based humanpose estimation using annealed particle swarm optimizationwith search space partitioning,” in 2014 4th InternationalConference on Computer and Knowledge Engineering (ICCKE),pp. 135–140, Oct 2014.

[18] D. C. Blumenthal-Barby and P. Eisert, “High-resolution depthfor binocular image-based modeling,” Computers & Graphics,vol. 39, pp. 89–100, 2014.

[19] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE Mul-tiMedia, vol. 19, pp. 4–10, Feb 2012.

[20] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele,“Neural body fitting: Unifying deep learning and model basedhuman pose and shape estimation,” in 2018 InternationalConference on 3D Vision (3DV), pp. 484–494, IEEE, 2018.

[21] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.Black, “SMPL: A skinned multi-person linear model,” ACMTrans. Graphics (Proc. SIGGRAPH Asia), vol. 34, pp. 248:1–248:16, Oct. 2015.

[22] M. Drennan, “An implementation of camera calibration algo-rithms,” Clemson University, 2010.

[23] D.-x. Zhu, “Binocular vision-slam using improved sift algo-rithm,” in 2010 2nd International Workshop on IntelligentSystems and Applications, pp. 1–4, IEEE, 2010.

[24] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Dani-ilidis, “Sparseness meets deepness: 3d human pose estimationfrom monocular video,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, pp. 4966–4975,2016.

[25] M. Kocabas, S. Karagoz, and E. Akbas, “Self-supervised learn-ing of 3d human pose using multi-view geometry,” in TheIEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2019.

[26] C.-H. Chen, A. Tyagi, A. Agrawal, D. Drover, R. MV, S. Sto-janov, and J. M. Rehg, “Unsupervised 3d pose estimation withgeometric self-supervision,” arXiv preprint arXiv:1904.04812,2019.

[27] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, andS. Belongie, “Feature pyramid networks for object detection,”in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pp. 2117–2125, 2017.

[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in Proceedings of the IEEE conferenceon computer vision and pattern recognition, pp. 770–778, 2016.

[29] G. Rogez, P. Weinzaepfel, and C. Schmid, “Lcr-net:Localization-classification-regression for human pose,” in Pro-

ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pp. 3433–3441, 2017.

Proceedings of APSIPA Annual Summit and Conference 2019 18-21 November 2019, Lanzhou, China

1434