Faculteit Industriële Ingenieurswetenschappen* – Faculteit ... · Het skelet dat in deze thesis geproduceerd wordt moet gebruikt worden om boven op de lichaamsgebieden te leggen

Departement Industriële Ingenieurswetenschappen

Master in de industriële wetenschappen: Elektronica-ICT afstudeerrichting ICT

Creation of a Virtual Pseudo-skeleton Through

3D Imagery Masterproef voorgedragen tot het behalen van de beroepstitel van industrieel ingenieur. Academiejaar 2011-2012

Door:

Promotor hogeschool:

Promotor bedrijf:

Martijn Van Loocke

Toon Goedemé

Koen Buys

ii

Contents

Contents iii

List of Figures v

List of Tables vii

1 Contextualization and De�ning the Objectives 1

1.1 Contextualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Possible uses for the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Combination of this thesis with other research . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 An adaptable system for human body tracking . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Process and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Brief overview of this text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Literary Study 9

2.1 Kinect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Background removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Grabcut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Existing methods for Skeletal Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Methods to be used in this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 General Coding Methods and I/O 21

3.1 ROS, PCL & OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 OpenCV and PCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Structure of the code and �les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Saving the frames to disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

iii

iv CONTENTS

4 Background Removal 294.1 OpenCV vs PCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Grabcut (OpenCV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.2 Thresholding along the x, y and z axes (PCL) . . . . . . . . . . . . . . . . . . . . . . . 314.1.3 Segmenting the surfaces in the point cloud . . . . . . . . . . . . . . . . . . . . . . . . 324.1.4 Segmentation based on colour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.1.5 Segmentation through connected components . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Combination with Grabcut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Generating the Skeleton 415.1 Two-dimensional skeleton algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Required assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.2 Starting point & general strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1.3 Following the skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.1.4 Detecting the legs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.5 Detecting the end of a body part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Three-dimensional skeleton algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.1 Projecting from 2D to 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Testing 496.1 What needs to be tested . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 How testing was performed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Issues encountered and resolved during testing . . . . . . . . . . . . . . . . . . . . . . . . . . 516.4 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7 Future work 577.1 Improving upon the 3D skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577.2 Keeping the 2D algorithm on track when tracing an arm . . . . . . . . . . . . . . . . . . . . . 577.3 Combining with the segmentation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 Conclusion 598.1 Progression and steps followed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598.2 Objectives planned and achieved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608.3 Closing thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Bibliography 67

List of Figures

1 Geplande stappen (a) in een �ow-diagram (b) geillustreerd . . . . . . . . . . . . . . . . . . . xvi2 (a) Resultaten van de methode gecreëerd door Koen Buys et al. (b) Distance transform van

het achtergrondmasker met het beginpunt aangeduid (c) Voorbeeld van een resulterend 2Dskelet (d) 3D skelet zonder aanpassingen (e) 3D skelet na gemiddelde van omliggende punten xvii

1.1 (a) Personal Robot 2 (PR2) (b) Turtlebot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 (a) MakeHuman mesh model (b) PR2 accepting a bottle . . . . . . . . . . . . . . . . . . . . . 31.3 (a) Input point cloud for segmentation algorithm (lines added post processing) [2] (b) Results

from segmentation algorithm [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Planned stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Removing the background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 (a) Skeleton Overlay (2D Image) (b) Skeleton Overlay (3D Point Cloud) . . . . . . . . . . . . 6

2.1 Microsoft Kinect (with legend) [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Kinect IR Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 (a) Triangulation [21] (b) Stereo vision [23] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 (a) Random Graph with Nodes and Edges (b) Image with pixels represented as a graph . . . 132.5 (a) Minimum Cut (b) Maximum Cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Energy Function (E) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 Separating the legs, as used in [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 Calculating the centroid of one slice, as used in [24] . . . . . . . . . . . . . . . . . . . . . . . 192.9 (a) Straight Skeleton (Rough Sketch) (b) Media Axis Skeleton (Rough Sketch) . . . . . . . . 192.10 Using distance transformations to �nd the skeleton . . . . . . . . . . . . . . . . . . . . . . . . 202.11 Example of an image with feature(*) and non-feature(-) pixels along with the values assigned

through a distance transform [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Point Cloud Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Results from using Grabcut (with user interaction) . . . . . . . . . . . . . . . . . . . . . . . . 314.2 (a) Surface Segmentation (Close-up) (b) Surface Segmentation (Far) . . . . . . . . . . . . . . 344.3 HSV Colour Space[26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 (a) Original Image (b) SeededHueSegmentation + Grabcut Mask (c) Original Image + Mask . 364.5 Connected components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6 (a) Untouched mask (b) Mask after Erosion and Dilation (c) Final Image (with mask applied) 39

5.1 (a) 2D skeleton (b) Test image for skeleton creation . . . . . . . . . . . . . . . . . . . . . . . 42

v

vi LIST OF FIGURES

5.2 (a) Result of a distance transform(inverted) (b) Following a vector (blue) through a methodthat utilises steps (green) instead of full Cartesian coordinates (red) . . . . . . . . . . . . . . 43

5.3 Procedure for following the 2D skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4 (a) Cross-section before and after the branching of the legs (b) Distance transform values above

the branch (c) Distance transform values below the branch . . . . . . . . . . . . . . . . . . . 455.5 (a) The two stages of the legs (b) Branch in the distance transform at the top of the head (c)

False end detection when using a single reference point . . . . . . . . . . . . . . . . . . . . . . 465.6 Detecting the end of a body part. Part circle outside the body is shown in percentages. . . . 475.7 (a) 3D skeleton (original) (b) 3D Skeleton (after averaging �lter) . . . . . . . . . . . . . . . . 48

6.1 (a) T-Pose (b) Y-Pose (c) I-Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.2 (a) False starting point (b) Corrected starting point . . . . . . . . . . . . . . . . . . . . . . . 526.3 (a) Complete skeleton (no end-detection) (b) Incomplete skeleton due to failed end-detection

(points skipped in leg when stuck) (c) Incomplete skeleton due to bent arm (no end-detection) 54

List of Tables

6.1 Average success rates (a) without end-detection (b) with end-detection . . . . . . . . . . . . . 546.2 (a) Average time taken for the creation of the 2D and 3D skeletons (b) Average number of

stuck warnings that were subsequently corrected (along with min/max values) . . . . . . . . . 54

vii

viii LIST OF TABLES

Foreword

Before the start of this thesis text I would like to thank a few people. These are my promoters Koen Buysand Toon Goedemé for their support, the department of Mechanical Engineering at the K.U. Leuven forproviding a Kinect to work with and the creators of OpenCV and PCL for their continued development ofthese libraries.

ix

x LIST OF TABLES

Abstract (Dutch)

Het doel van deze thesis is het creëren van een algoritme dat uit een 3D puntenwolk en een 2D RGB-beeld eenpersoon kan halen en een pseudoskelet kan opbouwen dat deze persoon voorstelt. Eerst wordt een methodeontwikkeld om de achtergrond te verwijderen, gevolgd door de creatie van een 2D skelet. Hierbij wordenbepaalde veronderstellingen genomen om dit goed te laten verlopen. De resultaten worden dan gebruikt omeen 3D versie van dit skelet te maken, gebaseerd op de puntenwolk. Er wordt gebruikt gemaakt van eenMicrosoft Kinect samen met de PCL en OpenCV bibliotheken waarmee in C++ geprogrammeerd worden.Het uiteindelijke pseudoskelet moet later gebruikt worden in combinatie met het onderzoek van Koen Buyset al. [2] om een open source gewricht tracking systeem te maken.

Mijn eigen skelet volgt de middenlijn die door de verschillende lichaamsdelen loopt en vormt een stokventjeterwijl het process van Koen Buys et al. verscheidene lichaamsregionen creëert, gebaseerd op kans. Theregionen hebben zeer ruwe grenzen die te onvoorspelbaar zijn om gebruikt te worden voor het vinden vanbetrouwbare referentiepunten. Wanneer mijn eigen skelet er boven op wordt afgebeeld, kunnen de snijpuntenvan de grenzen en het skelet genomen worden als meer betrouwbare referentiepunten. De algoritmes gebruikeneen paar basis beeld- en puntenwolkverwerkingsmethodes aangeboden door OpenCV en PCL en de methodesvoor het creëren van de skeletten zijn ontworpen geweest uit het niets. Uitgezonderd bij een paar poses,werken de algoritmes voor de creatie van de skeletten zoals verwacht mits een paar veronderstellingen.

xi

xii LIST OF TABLES

Abstract (English)

The goal of this thesis is the creation of an algorithm that extracts a person from a 3D point cloud and a2D RGB-image and constructs a pseudo-skeleton that is representative of the person. First, a method isdeveloped for removing the background, followed by the creation of a 2D skeleton. Speci�c assumptions aremade to ensure that it performs well. The results are then used to create a 3D version of this skeleton, basedon the point cloud. A Microsoft Kinect is used along with the PCL and OpenCV libraries, which are utilisedin C++. The resulting pseudo-skeleton is to be used in combination with the research by Koen Buys et al.[2] to create an open source joint tracking system.

My own skeleton tracks the centre line that runs through the body parts to form a stick �gure while theprocess by Koen Buys et al. creates various body regions, based on probability. These regions have veryrough boundaries, which are too ambiguous to be used for locating key reference points. When my ownskeleton is used as an overlay, the intersections between the region boundaries and my skeleton can be usedas more reliable reference points. The algorithms use some basic image and point cloud processing methodsprovided by OpenCV and PCL and the methods for creating the skeletons have been designed from theground up. Apart from a few poses, the algorithms for the creation of the skeletons work as expected, giventhat a few assumptions are made.

xiii

xiv LIST OF TABLES

Korte Samenvatting

Doelstelling

Het doel van deze thesis is het creëren van een 2-dimentioneel (2D) en 3-dimentioneel (3D) pseudo-skelet inde vorm van een stokventje met gebruik van een Kinect camera. Deze camera kan tegelijk een 2D beeld eneen 3D puntenwolk van een omgeving produceren. Dit maakt het mogelijk om met zowel de RGB waardenals met de diepte van de pixels te werken. Deze skeletten worden gebruikt in combinatie met de resultatenvan het onderzoek van Koen Buys et al. [2]. Hun systeem gebruikt de informatie die van de Kinect komt omeen persoon uit zijn achtergrond te halen en via een random decision forest verscheidene lichaamsdelen teidenti�ceren. Het resultaat kan gezien worden in �guur 2a. De willekeurige aard van deze methode zorgt datde grenzen tussen de lichaamsgebieden niet vlot zijn en als gevolg is het zeer moeilijk om het middenpuntvan die grens vast te leggen om als referentiepunt te gebruiken.

Het skelet dat in deze thesis geproduceerd wordt moet gebruikt worden om boven op de lichaamsgebiedente leggen en de punten waar het skelet en de grenzen kruisen zouden kunnen genomen worden als meerbetrouwbare referentiepunten. Het skelet zal door het midden van elk lichaamsdeel lopen dus kan het gebruiktworden om, bijvoorbeeld, de schouders te vinden. Er bestaan al methodes om dit te bereiken [11, 19] maardeze hebben een gesloten broncode en de bedoeling van dit project is om de broncode open te maken vooriedereen.

Figuur 1 toont de geplande stappen die gevolgd zijn om de doelen te bereiken en die de algemene vormvan deze thesis tekst voorstellen.

Achtergrondverwijdering

Hoewel de methode van Koen Buys et al. de achtergrond kan verwijderen, was deze methode niet beschikbaartijdens de ontwikkeling van de code in deze thesis (het was con�dentieel en gelijktijdig in ontwikkeling). Omeen skelet te kunnen tekenen, moet de persoon eerst van zijn achtergrond afgezonderd worden. Er wordtverondersteld dat de achtergrond altijd in beweging is en dat er maar één beeld beschikbaar is om mee tewerken. Deze thesis verkent verschillende methodes en presenteert uiteindelijk een methode die een combinatieis van de vorige.

De gekozen methode gebruikt eerst de dieptewaarden van de puntenwolk om alles dat achter de persoonstaat uit het beeld te halen. Hierbij wordt verondersteld dat de persoon ongeveer in het midden van hetbeeld staat. Daarna wordt de vloer gevonden door de puntenwolk te segmenteren gebaseerd op groepen vanpunten die horizontale vakken voorstellen. Het grootste vlak wordt verondersteld de vloer te zijn. Na devloer uit het beeld te halen, wordt op het overblijvende beeld connected components labelling toegepast. Degroep pixels die de centrale pixel bevat stelt dan de persoon voor. Het resultaat van deze methode is een

xv

xvi LIST OF TABLES

(a)

(b)

Figure 1: Geplande stappen (a) in een �ow-diagram (b) geillustreerd

binair masker met een waarde 0 als achtergrond en 1 of 255 als voorgrond (de persoon).

Creëren van de skeletten

Aangezien de Kinect enkel een 3D beeld kan maken vanuit één standpunt, is de 3D informatie niet genoegom rechtstreeks een 3D skelet te maken. Hierdoor is er besloten om eerst een 2D skelet te maken, gebaseerdop het 2D beeld en daar dan dieptewaarden aan toe te voegen met gebruikt van de 3D puntenwolk.

Om het 2D skelet te beginnen wordt op het masker van de achtergrondverwijdering een distance transformtoegepast (voorbeeld in �guur 2b). Dit geeft een waarde aan elke voorgrondpixel die de kortste afstand naareen achtergrondpixel voorstelt. Dit maakt dat de hoogste waarden (lokaal bekeken) op de lijn zullen liggen diehet skelet voorstelt. Dit is zichtbaar voor de mens maar niet voor een computer. Om dit op te lossen wordtin deze thesis een algoritme ontworpen dat deze lijn volgt. Het startpunt ligt ter hoogte van de schouders.Vanaf dat punt worden de armen, de nek en de romp gevolgd. Onderaan de romp wordt het begin van debenen gedetecteerd en worden de benen ook apart gevolgd. Het resultaat is een reeks groepen van locatiesvoor deze 6 lichaamsdelen. Een voorbeeld van het resulterende skelet wordt gegetoond in �guur 2c. Er wordtook een methode geïmplementeerd om te detecteren wanneer het einde van een lichaamsdeel tegengekomenis.

Het 3D skelet wordt opgesteld uit de locaties van het 2D skelet en de dieptewaarden uit de puntenwolkaan de meest dichtbijliggende randen van het masker. Door de techniek die de Kinect gebruikt om de dieptete meten, zijn de diepte waarden aan de randen van de persoon vrij onstabiel. Hierdoor is het resulterende 3Dskelet zeer laag van kwaliteit (voorbeeld in �guur 2d). Om dit op te lossen wordt het gemiddelde genomen vande dieptewaarden van elk punt en een grote hoeveelheid van de omliggende punten. Dit geeft een resultaat

LIST OF TABLES xvii

(a) (b) (c)

(d) (e)

Figure 2: (a) Resultaten van de methode gecreëerd door Koen Buys et al. (b) Distance transform van hetachtergrondmasker met het beginpunt aangeduid (c) Voorbeeld van een resulterend 2D skelet (d) 3D skeletzonder aanpassingen (e) 3D skelet na gemiddelde van omliggende punten

met een iets lagere nauwkeurigheid maar een veel hogere kwaliteit (de vorm van het skelet is veel vlotter).Figuur 2e toont het skelet na deze aanpassing.

Conclusie

Na het testen van de algoritmes wordt vastgesteld dat de resultaten zeer goed zijn. Gegeven de veronder-stelling dat de armen en benen elkaar en de romp niet raken (omdat die er anders uit zien als één massawaardoor één skeletlijn getrokken wordt), worden het hoofd, de romp en de benen altijd volledig gedetecteerd.Bij sommige benen moesten aan de connectie tussen romp en bovenbeen een paar punten worden overgeslagenom het lichaamsdeel te kunnen blijven volgen maar nadien volgde het algoritme het wel tot op het einde. Dearmen worden altijd ten minste gedetecteerd tot na de ellebogen. Als de arm te veel naar boven gebogen is,kan het algoritme van zijn verwachtte pad wijken maar dan is enkel het deel na de elleboog verloren.

De detectie van het einde van een lichaamsdeel (dus stoppen in het midden van het hoofd, midden van dehanden, enz...) werkte minder goed. Het had een laag succespercentage omdat het meestal te vroeg stopte.

xviii LIST OF TABLES

Als deze detectie niet gebruikt wordt, stopt het algoritme automatisch wanneer het aan de rand van hetmasker komt. Gezien deze detectie alleen nodig is voor het visueel presenteren van het skelet, is het niet zobelangrijk. Het skelet zonder eindedetectie kan zonder problemen gebruikt worden voor het geplande doelmet het werk van Koen Buys et al.

De doelen van deze thesis zijn behaald en in hoofdstuk 7 worden uitbreidingen voorgesteld die het algoritmekunnen verbeteren.

Chapter 1

Contextualization and De�ning theObjectives

This chapter will contextualize this thesis and de�ne the objectives in the form of a set of speci�c questions.These questions will need to answered during the course of the thesis and pros and cons of the various choicesthat arise will need to be weighted in order to strike the right balance between e�ciency, workability andconvenience.

The goal of this thesis has been set by the Mechanical Engineering Department at the Catholic Universityof Leuven. Part of the assignment is that the programming needs to be done using ROS (Robot OperatingSystem) [17]. The usage of ROS will be covered in more depth in chapter 2.

1.1 Contextualization

The ROS software was created to support robots that are created by Willow Garage [16]. Two of these arethe Willow Garage Personal Robot 2 (PR2) in �gure 1.1a and the TurtleBot in �gure 1.1b. The PR2 comesequiped with 2 fully functional arms and a camera mounted within its head and the Turtlebot is designedto hold a Microsoft Kinect camera (Kinect is explained in chapter 2). A Kinect can be mounted on topof the PR2 and both robots are designed to be highly mobile and be able to perceive and analyse a 3Dspace through the Kinect. Kinect is ideally suited for this as it weighs little and its dimensions are a mere27.5× 6× 7.5 cm. It is capable of creating 3D point clouds to represent a 3D space and with the necessarycalculations both robots can use this input to navigate their environments.

One way to utilise this is human interaction. Currently, no implemented method exists in ROS to createa detailed human skeleton. The only available method is the location of the main joints in the body throughthe openni_tracker package (based on the OpenNI NITE Library) and connecting those points to form apseudo-skeleton. The goal of this thesis is to design a method for the creation of a virtual human pseudo-skeleton based on a 3D point cloud and a 2D image with the use of the PCL (Point Cloud Library) [14] andOpenCV (Open Computer Vision) [15] libraries. The resulting skeleton needs to be made up of a su�cientlyhigh number of points to represent the person. Given the pixelated nature of the images, the accuracy andquantity of the skeleton points depends greatly upon the resolution of the image. A typical robot camera hasa VGA resolution (640 by 480) and a skeleton point every 2 to 3 pixels with a deviation from the skeletalline of the same amount is both desirable and realistic. As the Kinect would be mounted on a robot, the

1

2 CHAPTER 1. CONTEXTUALIZATION AND DEFINING THE OBJECTIVES

(a) (b)

Figure 1.1: (a) Personal Robot 2 (PR2) (b) Turtlebot

background is assumed to be constantly changing.

1.2 Possible uses for the results

A virtual skeleton can be used for a variety of purposes. One such purpose would be to create an avatar of theperson in a virtual world. An open source software package that could be used for this is MakeHuman [25].Though much more detailed renders can be produced, �gure 1.2a shows a crude mesh created in MakeHuman.

Another use of the skeleton would be interaction with the PR2. For example, when the user stretches outhis/her arm to hand the robot an object, the orientation of the person's arm indicates to the robot that itneeds to interact with the user. Figure 1.2b shows an example of this.

The skeleton data can also be used for less active tasks. My promotor at the university plans to use thedata for unit tests and as a prior to a Markov chain Monte Carlo (MCMC) �lter.

1.3 Combination of this thesis with other research

Near the end of the second semester, I was informed about the intended purpose of my results. It is to be usedas an enhancement to a research project that was being developed in parallel with my own. Unfortunately,due to the con�dential nature of this research, I was unaware of this during the development of my own code.As will be shown, however, most of my work is still applicable and will be able to complement the otherresearch as intended.

1.3. COMBINATION OF THIS THESIS WITH OTHER RESEARCH 3

(a) (b)

Figure 1.2: (a) MakeHuman mesh model (b) PR2 accepting a bottle

A paper named �An adaptable system for human body tracking� is to be published in 2012 by Koen Buys(my promoter) et al. [2] which describes this research and its results. It will become a part of the PCLlibrary. Below follows a brief description of this research and its connection to my own work.

1.3.1 An adaptable system for human body tracking

The purpose of the system is the detection of a human body in a scene and segmenting it based on bodyparts. The method is meant to be an open source alternative to the OpenNI NITE human tracking libraryincluded in ROS and the method developed by Shotton et al. [11] for the Kinect. Aside from the open sourcecode, the main di�erence is that this new method does not require a speci�c initial pose (unlike the NITElibrary) and uses both depth and colour information (Shotton et al. only use depth) to reduce the amountof training needed. It is also designed to continue working, regardless of how cluttered the environmentis. The algorithm is trained using random decision forests and about 80.000 poses that were recreated inMake Human. These Make Human meshes are enough to represent any person of either gender. Because ituses feature detection rather than algorithms based purely on the current image, background removal is notneeded.

The method produces a set of body segments represented by various colours. Because of this, I will referto this method in later sections as the segmentation algorithm. Figure 1.3b shows the results it produces.When the entire body is visible, a total of 25 body areas are di�erentiated: 4 on the head, neck, 4 on thechest, upper arms, elbows, lower arms, hands, upper legs, knees, lower legs and feet.

As can be seen in �gure 1.3a, a crude skeleton is created by connecting the centres of the various bodyareas. Koen Buys wanted to improve the accuracy of this skeleton and speci�cally the location of the joints.This is where my own thesis comes in. My skeleton will follow the shape of the body rather than connect afew points that are based on a random decision forest. Combining the two should yield better results than


(a) (b)

Figure 1.3: (a) Input point cloud for segmentation algorithm (lines added post processing) [2] (b) Resultsfrom segmentation algorithm [2]

Figure 1.4: Planned stages

each on its own. In theory, the halfway point along the line where two segments meet should be the location ofa point on the skeleton but the extremely jagged nature of this line makes it very di�cult specify that point.It is our intention that it will be found at the intersection of the separating line and my pseudo-skeleton.

Note that this method and research were not available to me during my thesis. I was therefore not ableto use it for background removal and thus had to develop my own method for that.

1.4 Process and objectives

The process will consist of the following main stages, illustrated in �gure 1.4:

� Capture and store the needed data. It may be necessary to combine the incoming data in such a waythat it remains synchronised as both the point cloud and the rgb image are needed.

� Separate the person from the background in the 2D image. The added dimension contained within a3D point cloud could help to make this easier. Figure 1.5 shows the expected results. From left to rightit shows: the original image, the 3D point cloud with a depth threshold (the yellow plane) to removethe background, the binary mask that will be placed over the original image to separate the personfrom the background and the �nal result.

1.4. PROCESS AND OBJECTIVES 5

Figure 1.5: Removing the background

� Develop a method to draw a pseudo-skeleton on the 2D image of the person. A crude example of theexpected end result is shown in �gure 1.6a.

� Adapt the method used for a 2D skeleton for use with the 3D point cloud. A crude example of theexpected end result is shown in �gure 1.6b.

� Test both algorithms with a variety of subjects and poses and (if time allows) implement them in anapplication. This could, for example, be the creation of a virtual avatar using the skeleton as inputdata.

The code will be open sourced so that it can be used and improved by other ROS users. For the purposesof this thesis, a Microsoft Kinect camera will be used but the code will be written to function with similarcameras as well (Asus Xtion, PrimeSense PSDK5.0, ...).

The objectives of this thesis are as follows:

� What is the most e�cient way to segment an image in order to separate a person from his/her back-ground? PCL and OpenCV o�er many solutions but they do not all produce the same quality of resultsand are not equally e�cient at solving this problem.

� What degree of accuracy is needed when removing the background to facilitate the generation of theskeleton ? It may be su�cient to draw a rough outline as too much detail may interfere with the speedof the process.

� How is the skeleton best constructed? If the algorithm starts at the extremities, how are the various partinterconnected when they meet? If it starts in the middle, how are the separate body parts detected?Is additional data needed from an external program?

� Is it possible to create a true 3D skeleton with the accuracy provided by the kinect?

� What assumptions must be made to keep the various aspects manageable?


(a) (b)

Figure 1.6: (a) Skeleton Overlay (2D Image) (b) Skeleton Overlay (3D Point Cloud)

1.5. BRIEF OVERVIEW OF THIS TEXT 7

1.5 Brief overview of this text

This �rst chapter describes the context of this thesis. The stages and objectives are determined and keyquestions are asked that will need to be answered during the course of this study. Chapter 2 is a literarystudy that covers the main concepts that will need to be understood and how they apply to the issues thatwere stated in chapter 1. The literary study also compares existing technology and methods to those thatwill be developed here. Chapter 3 explains what methods will be used for �le storage and management andwhich general programming structures will be used. Chapter 4 covers the programming code for backgroundremoval, how it was developed and the possibilities that were available. When multiple options were available,this chapter also compares and contrasts them to defend my �nal choice. Chapter 5 covers the developmentprocess for the skeletal generation. it covers both the 2D as well as the 3D versions. Chapter 6 details thetesting process, its results and the conclusions drawn, followed by chapter 7 which explores the possibilitiesfor the algorithms that I created and its place in the research by Koen Buys et al. [2]. Finally, chapter 8presents my conclusions and compares them to the stages and objectives set previously in section 1.4.


Chapter 2

Literary Study

This literary study will cover the main concepts needed for this thesis and will be viewed in the conceptualsense. This chapter will give an idea of what will be used and how it relates to this thesis. It will also mentionany related alternatives and why a particular one is chosen. Section 2.1 details Kinect and compares it tosimilar cameras. Section 2.2 discusses some key aspects in background removal and section 2.3 comparesexisting methods for skeletal data generation with the objectives of this thesis.

2.1 Kinect

Kinect is a product sold by Microsoft and created by PrimeSense [27], designed to be used with the XBox360 Games Console to enable motion controls for the system. It features an RGB camera (1), a depth sensor(2), an IR-projector (3) and a multi-array microphone. What makes the Kinect special is that it can provideboth 2D colour data and 3D depth data simultaneously. Figure 2.1 shows the Kinect with the location ofthese features. At the time of writing, a Kinect can be bought for roughly 150 euros.

Drivers Though kinect was made to be used with an XBox 360, the manufacturer of the hardware, Prime-sense, has released the drivers to the device for use on PC. The OpenNI library [19] o�ers tools to writesoftware that can interact with the kinect.

Figure 2.1: Microsoft Kinect (with legend) [4]

9

10 CHAPTER 2. LITERARY STUDY

Figure 2.2: Kinect IR Projection

Depth perception According to the PrimeSense website [20], the Kinect projects a �eld of near-IR laserlight which is then captured by a regular CMOS sensor. The technology used for depth perception is calledLight Coding. Viager [1] explains that the light passes though a �lter and is scattered into a semi-randombut constant pattern of small dots. The IR dots are deformed depending of the surface they strike andthis is deciphered by PrimeSense's System on a Chip (SoC) within the Kinect to create a depth image ofthe scene. PrimeSense claims that this method makes it immune to ambient light. A known problem withthis technology is that strong sunlight can interfere with the results as it contains high amounts of IR light,disrupting the projection. This means that Kinect will not produce satisfactory results outside or when aimedat a bright light source (such as a window or a light bulb). The use of a SoC allows for the CMOS sensor toonly have to capture the scene (unlike other technologies where, for example, the sensor has to calculate thetime taken for the light to return). The projection is divided into 9 areas with a brighter dot in each centre(visible in �gure 2.2). This helps to evaluate the position of the objects in the space. By taking into accountthe emitted light pattern, lens distortion, and distance between emitter and receiver, the distance to eachdot can be estimated by calculation. According to the study by Viager [1], the depth is given an 11-bit valueand the Reference Design fact sheet for the hardware [3] states that the resolution of the projected IR �eldis 1600× 1200. The camera outputs a depth image at a resolution of 640× 480, which is the same resolutionas that of the RGB image, allowing the two to be combined as there is a depth value for each pixel. Thesigni�cantly greater resolution of the projected IR-�eld allows a margin of error that facilitates maintainingthe needed precision when downscaling to 640× 480, thus providing a unique depth value per pixel. Figure2.2 shows the IR projection produced by a Kinect [1].

2.2. BACKGROUND REMOVAL 11

Alternatives to the hardware Kinect was chosen for this thesis because the hardware is relatively cheap,it has all of the needed functionality and the drivers are readily available. In addition to this, Kinect is nativelysupported by ROS. ROS also supports PrimeSense PSDK and ASUS Xtion Pro (lacking RGB output) whichwork with the same OpenNI drivers. Kinect was released �rst and is the most widely used of the three,making it more suitable for the purposes of this thesis.There are a few alternatives to the technology used in Kinect:

Time-of-�ight (TOF) [21, 22]: These use light to actively probe the scene. The distance from thecamera is determined by measuring the time needed for a pulse of light to strike the object and return to thesensor. A longer time will indicate a farther distance. Typical time-of-�ight 3D laser scanners can measurethe distance of 10,000 - 100,000 points every second. The high speed of light makes calculating the round triptime more di�cult, resulting in reduced accuracy (on the order of millimeters). TOF scanners are designedto work over long distances such as for scanning buildings but also work with low resolutions. An exampleis the SwissRanger by MESA [28].

Triangulation [21]: A triangulation 3D scanner shines a laser light on the subject and the re�ectedlight is captured by a sensor that is placed at a location di�erent to that of the the laser source. As shownin �gure 2.3a, the light will strike the sensor depending on how for the object is removed. The accuracy oftriangulation range �nders is on the order of tens of micrometers. Their range is, however, limited to a fewmeters and the method is slow.

Stereo vision [23]: Stereo vision takes 2 cameras placed a short distance from one another and usesthe di�erence between the captured images to mathematically determine the depth of an object in the spaceas illustrated in �gure 2.3b. The problem that arises with this is that reference points need to be found inthe images, which are already present in each of the other three methods. Objects that are further away alsochange less, making the depth harder to determine.

The �rst two methods mentioned above would be more accurate and powerful alternatives to the LightCoding used in Kinect but the high price and the bulkiness of the hardware makes them unsuitable. Kinectis also readily available for anyone to buy in a store while TOF and Triangulation scanners are industrial.Stereo vision could be an alternative to Kinect but the requirement of 2 lenses heightens the price. Finally,the robots described in chapter 1 both use either a kinect or similar hardware. It therefore makes sense touse a Kinect camera and it is overall more suitable for the purposes of this thesis.

2.2 Background removal

In order to �nd the location of the skeleton, the background must �rst be removed from the image. The goalis to have only the person present in the image and nothing else. This also means removing the shadow. Thiswill need to be done with a single RGB image and its corresponding point cloud.

A suitable tool for this which is part of the OpenCV [15] library is Grabcut. It is a method developed atMicrosoft Research Cambridge [5] and is based on the graph cut method. Graphs can be used to representan image (see subsection 2.2.1). The connections between the pixels can then be used to determine the bestpath along which to make a cut that will remove an edge from the image. The intended edge is the outlineof the person. Grabcut uses this principle and iterates until the cut runs along the most prominent edge.This study will only cover the general concepts used in these methods and not the mathematical proceduresbehind them.


OpenCV works with 2D images. It could be argued that applying a depth threshold to the 3D pointcloud would be su�cient for �nding the background but preliminary experiments have shown that at furtherdistances (starting around 2-3 meters), the point cloud exhibits faults in the depth values. Strong ambientlighting will cause background to become part of the person and low ambient light will cause the outer partof the body (such as hands) to become part of the background or be located somewhere in between the two.Grabcut bases its results on edge-detection, producing much more accurate results. The problem it has isthat the edges that are part of the background can interfere with this and a certain level of uses interaction(which needs to be made automatic for the purposes of this thesis) is needed to attain the desired results.An added advantage of depth thresholding is that the shadow is no longer an issue as it is merely a changein lighting and does not have volume. The hypothesis to be tested as part of this thesis is that combiningGrabcut with depth thresholding should negate the disadvantages of both methods.

Other background removal techniques exist but most require a moving image, which is not applicable tothe given situation.

2.2.1 Graphs

One method (Grabcut) is viable solution for background removal. To understand how it works, graphs must�rst be explained. As detailed in [6], a graph is a collection of dots that are connected. These connectionscan be between any two dots but not every possible connection necessarily exists in a given graph. A pointis usually referred to as a vertex or a node. In an image, the nodes are represented by the pixels. Theconnections are referred to as edges. In an image, edges only exist between adjacent pixels. Figure 2.4ashows an example of a random graph and �gure 2.4b shows how the pixels of an image are represented by agraph.

2.2.2 Graph Cuts

In graph theory, a cut is a path through a set of edges that begins and ends outside of the graph.

Minimum Cut A minimum cut is a cut that passes through the smallest number of edges possible whilestill satisfying any prede�ned conditions. Figure 2.5a shows an example of this. In background removal withgraph cuts, the optimal solution is a minimum cut.

Maximum Cut A maximum cut is a cut that passes through the greatest number of edges possible whilestill satisfying any prede�ned conditions. Figure 2.5b shows an example of this.

Labelling and Markov Random Fields When trying to separate a foreground image from its back-ground, each pixel in the image is labelled. This label can be viewed as a number. Pixels with the same labelbelong together and a smaller number of di�erent labels will result in a smaller number of sections that theimage is divided into.

A popular method for labelling is the use of Markov Random Fields (MRF). According to [7], MRF's aregenerated by comparing a pixel to its neighbours and determining the probability of them belonging to thesame object. The labelling of a pixel therefore depends completely on the neighbouring pixels. This can bea problem when the background has similar colours to the foreground. In that case, some foreground pixelsmay be labelled as background and visa versa.

2.2. BACKGROUND REMOVAL 13

(a) (b)

Figure 2.3: (a) Triangulation [21] (b) Stereo vision [23]

(a) (b)

Figure 2.4: (a) Random Graph with Nodes and Edges (b) Image with pixels represented as a graph


Energy Minimization Along with a minimum cut, minimum energy is also part of the optimal solutionin a graph cut. In an image, the colours of the pixels that make up an object will change gradually. When theobject ends and the background becomes visible, there is a sudden jump in colour di�erence. This di�erence issaid to have a higher energy level than the di�erence between 2 pixels belonging to the same object. Cuttingalong the line that separates the object from it's background will lower the average energy of a subsectionof an image because the high energy level that was present at the cut is no longer part of either sides of thecut. This called energy minimization and is again limited by how many labels are available per image. Inthis case the only labels are foreground and background.

2.2.3 Grabcut

Grabcut is a method developed at Microsoft Research Cambridge [5] and is based on graph cuts.

Improvements Grabcut has several improvements when compared to a normal graph cut.

Less user interaction: A normal graph cut requires a varying level of user interaction to ensure thatthe intended parts are removed and speci�c areas of the foreground image are not. Often a graph cut willrequire the user to draw a contour around the object that he wishes to keep. The Grabcut method simpli�esthis to a mere rectangle by taking anything outside of the rectangle as a starting point for the background.

Iterative graph cuts: Once an area of background has been speci�ed with the rectangle, iterative graphcuts are used to �nd the optimal cut around the foreground object. Each iteration uses the results from theprevious cut to reduce the energy of the image. The structure of the Grabcut algorithm will guarantee thatthe function E of energy to iterations will converge as shown in �gure 2.6. Once it is detected that E ceasesto decrease signi�cantly, the iterations are stopped. The iterative cuts also signi�cantly improve results withimages that have similar fore and backgrounds. In cases with complex backgrounds, some user interactionmay be needed in addition to the initial rectangle but in general this is kept to a minimum.

Transparency at the edges: After the iterative graph cuts, Grabcut also applies transparency to theouter edges of the resulting image to make it look smoother and thus blend better with the new backgroundthat it will be added to. This is achieved by applying alpha values to the edges in a non-linear fashion.Though intended by the people at Microsoft research [5], this part has not been implemented in the OpenCVversion.

Conclusion Theoretically, Grabcut seems like it could yield the results required for this thesis. Whetherthis holds true in practice remains to be seen and will be further explored later on in this text.

2.3 Existing methods for Skeletal Data Generation

Several methods have been developed for skeletal tracking. Most of these detect body parts and poses usingstatistical data and are very e�cient en e�ective but only segment the body and generate joint locations.They do not form an actual skeleton and hence are not referred to as such. Instead, the points (head, hand,elbows, chest, knees, feet, ...) are connected with straight lines to form a pseudo-skeleton. This is typicallyused only for illustration. Few examples of these are listed below.

2.3. EXISTING METHODS FOR SKELETAL DATA GENERATION 15

Pose estimation has been developed by Grest et. al. [8, 9], based on Silhouette Information and IteratedClosest Point respectively. This is explicitly done without physical markers of any kind. In [10], Zhu &Fugimura estimate the location of the upper body parts such as the head and the arms by �rst recognizingdistinguishable features to be used for labelling major parts such as the head and hands. In a second phase,they then accurately estimate the location of the joints based on the previous data.

One of the most recent methods has been by Shotton et. al. at Microsoft Research Cambridge [11]. Theyran a deep randomised decision forest classi�er on a database of hundreds of thousands of training images.Some of these images were of real people but many were generated for the purpose of simulating everypossible pose. The use of this many images is to avoid over�tting. The result is a statistical estimation of thelocation of various areas of the body, separated by the location of the main joints. Joint position proposalsare generated through the use of an approach based on mean shift and reduces the e�ect of outlying pixelson the results. This algorithm produces frame-rates of approximately 200 frames per second on XBox 360hardware. Performance on a consumer PC would be similar.

While the above methods have been proven to work, they do not supply the data needed in this thesis.The resulting skeleton must be a continuous sequence of points that represent the entire body, not merelythe joints. Therefore a new method must be designed.

2.3.1 Methods to be used in this thesis

As mentioned before, the use of Grabcut and PCL will be necessary for background removal. A few methodsfor skeletal generation have been discussed but rather than creating an accurate model, they provide a roughpresentation based on the joint locations. Next, I will discuss a few methods that could be used for creatingthe desired skeleton.

Segmentation and slicing In his thesis [24], Niels Van Malderen used a method for segmenting the visualhull of a human body. He used a 3D model of the person using multiple cameras which resulted in a solid3D representation. This is not possible with a single Kinect as the scene can only be viewed from onedirection and my method will start with the 2D image, not the 3D point cloud, and then expand from there.Nevertheless, a few of the concepts can still be adopted for this thesis.

Van Malderen starts with the torso and assumes that it is the centre of the object. He takes a horizontalslice and �nds the centroid, repeating this process by moving up and down from the centre of the torso.The collection of centre points form the spine and this method is used to �nd the rest of the skeleton. Nexthe moves up, detecting the transition between the torso and the head through a sudden change in width,implying that the neck has been located. He then moves down from the torso, �nding the legs at the pointwhere a sudden change in width occurs as was the case with the head. He separates the legs by using thez-axis to determine which direction the person in facing. The width of the two legs will be greater thanthe depth as shown in �gure 2.7 (note the orientation of the axes). This exact method is not possible forthis thesis but if it is assumed that the person is roughly facing the camera (or facing away from it) then itshould not be an issue. With all of the other body parts gone, all that is left are the arms. Van Malderenalso di�erentiates between upper and lower arms and legs. This is because he uses a method of vertical orhorizontal slices, depending on the orientation of the body part. In addition to that, he also uses a method toensure that the presence of an arm in, for example, a horizontal slice of the torso does not o�set the centroidof that slice. This is shown in �gure 2.8. Zt is the original centroid and Z is the centroid of the mass of voxelsthat are interconnected within which Zt lies. It is assumed that Zt is always located within the boundariesof a masses of voxels.


Topological skeletons A topological skeleton is a a line that runs through an n-dimensional shape andis equidistant from the outlines that it runs parallel to. According to [13], two of the most widely usedgeometric skeletons are the Medial Axis (MA) and the so-called straight skeleton (SS).

Straight skeletons are constructed by dividing the shape into polygons where each polygon has one straightedge that belongs to the shape. The additional lines for the polygons are equidistant from the 2 closest edgesthat the lines run in the middle of. This method results in the skeleton being piecewise linear. Figure 2.9ademonstrates this.

Medial axis skeletons are similar to straight skeletons but when the shape bends away from the currentdirection of the skeleton, a curve is calculated that lies in the middle of all nearby edges and not just the 2that were originally used as a reference. This creates a more continuous skeleton but naturally requires morecomputations. Figure 2.9b shows an example of a media skeleton.

Because the outline of a person does not contain any discernible corners, the strict use of either of thesemethods in this thesis will be di�cult. The algorithms used will, however, be based on the general conceptof a topological skeleton. In order to achieve this, distance transformations could be used.

Distance transformations When the background has been removed, it has been replaced with zero pixelsThese are typically represented as black (having a value of 0, hence the name) but a very useful alternativeto this is an alpha value of 255, making the background completely transparent. OpenCV [15] containsfunctions for distance transformations that can be used to �nd the nearest zero pixel. This will be crucialto building a topological skeleton. The goals is to �nd the nearest background pixel in 2 opposite directions.The centre of these two points would then be a point on the skeleton. This is illustrated in �gure 2.10. Theconcept of distance transformations is a more general idea that applies to digital images. In [12], Borgeforsexplains that distance transformation converts a binary image, consisting of feature and non-feature pixels,into an image where all non-feature pixels have a value corresponding to the distance to the nearest featurepixel. Normally this computation should be for the entirety of the image but this is very resource heavy.Therefore, a more localised search �eld is used of a few pixels in each direction.

A general example of this is shown in �gure 2.11, where an asterisk (*) represents a feature pixel and aminus sign (-) represents a non-feature pixel. Every non-feature pixel receives a value to shows the shortestdistance to the shape of asterisks in the middle.

The OpenCV functions implement this concept but instead the feature pixels are zero pixels and thenon-feature pixels are the non-zero pixels.

2.3.2 Conclusion

The segmentation concepts used by Van Malderen could be useful for �nding reference points from whichto start creating the skeleton. This seems to be a possible replacement for the techniques mentioned in thebeginning of this section ([8, 11, 10]) for locating the main joints. The slicing technique, however, needs tobe adjusted as it only uses horizontal and vertical slices. It shares with topological skeletons the principlethat the skeleton is a line that runs through the centre of the object, equidistant from the sides. Distancetransformations can help to �nd the orientation of the body part and the equidistant line as shown in �gure2.10.

My intention is to �nd the general locations of either the joints or certain body areas and use thosereference points to start drawing a skeleton. A straight line that connects a set of two points (eg. elbow toshoulder) could provide a general direction and distance transformations will then be used to specify a moreaccurate location for each point in the skeleton.

2.4. SUMMARY 17

2.4 Summary

This thesis will use a Kinect to capture a 2D RGB image and 3D point cloud of a scene. Kinect uses anIR laser and light coding to determine depth values for the scene and outputs both images at a resolutionof 640 × 480. The background will then be removed, possibly using a combination of Grabcut and depththresholding as provided by the OpenCV and PCL libraries respectively. The combination of the two methodsshould negate the disadvantages of both.

Grabcut is based on the graph cut method that interprets an image as a collection of nodes, connectedby edges and cuts along the edges that have the highest energy. Grabcut iterates this procedure to producebetter results and reduces the required user interaction by initialising with a region of interest to specifyparts that are de�nitely background.

Existing methods locate the main joints of the body but do not create an actual skeleton. Instead, theyconnect those points to form a pseudo-skeleton. The purpose of this thesis is to create a closer estimation ofthe real skeleton by using distance transformations to create something similar to a topological skeleton.


(a) (b)

Figure 2.5: (a) Minimum Cut (b) Maximum Cut

Figure 2.6: Energy Function (E)

2.4. SUMMARY 19

Figure 2.7: Separating the legs, as used in [24]

Figure 2.8: Calculating the centroid of one slice, as used in [24]

(a) (b)

Figure 2.9: (a) Straight Skeleton (Rough Sketch) (b) Media Axis Skeleton (Rough Sketch)


Figure 2.10: Using distance transformations to �nd the skeleton

Figure 2.11: Example of an image with feature(*) and non-feature(-) pixels along with the values assignedthrough a distance transform [12]

Chapter 3

General Coding Methods and I/O

This chapter will cover the practical aspects that will be used for writing the programming code. In section3.1, the ROS platform, PCL and OpenCV will be introduced. Section 3.2 explains the structure of the classesand the �les and the I/O methods used when data needs to be saved for later use.

3.1 ROS, PCL & OpenCV

This thesis will be built using the Robot Operating System (ROS) [17]. It is a terminal based programmingenvironment (run from within Linux or Windows), created by Willow Garage [16], designed to provide opensource libraries and tools for building robotics applications. It is a platform for code sharing and middlewareand a development framework. It also provides certain parts of the PCL and OpenCV libraries withinseparate packages, enabling the programmer to create links between the two.

3.1.1 Topics

Along with a system for compiling code that automates certain steps for the programmer, ROS also facilitatesinterprocess communication (an example of middleware) by introducing topics. Topics are data streams thatare published on a network (this can be locally or over a physical network) by a singular process calleda ROS node. Other ROS nodes (including the publisher) can subscribe to the topic and receive the datafrom the publisher. This automates any network protocols for the programmer, making it much easier totransmit and receive data between programs. One computer runs the roscore tool in background and if eachcomputer on the network has that speci�c roscore set as its master, they will all receive the same topics.The connections between the nodes are peer-to-peer and roscore provides arbitration. Ros provides tools tosurvey the network for topics that are being transmitted through the commands rxgraph (gui) and rostopic(command line).

The basic template of a ROS subscriber node contains the following code (the commentary is based on asample code from ros.org):

1 #include <ros / ros . h>/**

3 * The ros : : i n i t () funct ion needs to see argc and argv so tha t i t can perform* any ROS arguments and name remapping tha t were provided at the command l i n e .

5 * For programmatic remappings you can use a d i f f e r e n t vers ion of i n i t () which takes* remappings d i r e c t l y , but for most command−l i n e programs , passing argc and argv

7 * i s the ea s i e s t way to do i t . The th i rd argument to i n i t () i s the name of the node .

21

22 CHAPTER 3. GENERAL CODING METHODS AND I/O

*

9 * You must c a l l one of the vers ions of ros : : i n i t () be fore using any other* part of the ROS system .

11 */ro s : : i n i t ( argc , argv , "nameOfNode" ) ;

13/**

15 * NodeHandle i s the main access point to communications with the ROS system .* The f i r s t NodeHandle constructed w i l l f u l l y i n i t i a l i z e t h i s node , and the l a s t

17 * NodeHandle des truc ted w i l l c l o s e down the node .*/

19 ros : : NodeHandle n ;

21 /*** The subscr i be () c a l l i s how you t e l l ROS that you want to rece ive messages

23 * on a given top ic . This invokes a c a l l to the ROS master node , which keeps a r e g i s t r y* of who i s pub l i sh ing and who i s subscr i b ing . Messages are passed to a ca l l b a ck

25 * function , here ca l l e d cha t terCa l l back . subscr i be () returns a Subscr iber ob j ec t tha t* you must hold on to un t i l you want to unsubscribe . When a l l copies of the Subscr iber

27 * ob j ec t go out of scope , t h i s ca l l b a c k w i l l automat ica l l y be unsubscribed from th i s* top ic .

29 *

* The f i r s t argument to the subscr i be () funct ion i s the name of the topic , the second31 * argument i s the s i z e of the message queue and the th i rd argument i s the name of the

* ca l l b a c k funct ion tha t w i l l be c a l l e d when a top ic i s de tec ted . I f messages are33 * arr i v ing f a s t e r than they are being processed , t h i s i s the number of messages tha t

* w i l l be bu f f e red up be fore beginning to throw away the o l d e s t ones . I f the35 * NodeHandle i s part of a c lass , a four th parameter i s needed to point to the c l a s s

* using " t h i s ".37 */

ro s : : Subsc r ibe r sub ;39

/* use i f not part of a c l a s s */41 sub = n . sub s c r i b e ( "camera/ rgb/ image_color " , 1 , &chat te rCa l lback ) ;

/* use i f part of a c l a s s */43 sub = n . sub s c r i b e ( "camera/ rgb/ image_color " , 1 , &className : : chatterCal lback , this ) ;

45 /*** ros : : spin () w i l l enter a loop , pumping ca l l b a c k s . With t h i s version , a l l

47 * ca l l b a c k s w i l l be c a l l e d from within t h i s thread ( the main one) . ros : : spin ()* w i l l e x i t when Ctrl−C i s pressed , or the node i s shutdown by the master .

49 * ros : : spinOnce () can be used i f you want to manually contro l when ros spins . This* way you can s t i l l c a l l other funct ions in a loop be fore ros : : spinOnce i s c a l l e d again .

51 */ro s : : sp in ( ) ;

All data coming from the Kinect such as the point cloud and the rgb image will be published by theopenni_launch node which is part of the openni_camera package included in the ROS installation. Thispackage is a wrapper for the driver for OpenNI depth(+rgb) cameras. According to the ROS website [18],ROS Electric currently supports the following models:

� Microsoft Kinect

� PrimeSense PSDK

� ASUS Xtion Pro (no RGB)

The 3 topics that are primarily interesting for this thesis are:

� camera/rgb/camera_info (Camera calibration and metadata needed to project 3D to 2D)

� camera/rgb/image_color (Raw image from device. Format is Bayer GRBG for Kinect)

� camera/rgb/points (Point cloud along with the rgb image to be used to colour each point)

3.1. ROS, PCL & OPENCV 23

The code will be based on data coming from the Kinect but should work with any device that supports pointclouds and rgb images and has a driver that publishes the needed ROS topics with the exception of someminor aspect such as colour formats.

Synchronizing topics It is possible to synchronize the three topics to ensure that they were captured atthe same moment in time. Partway through the development process I discovered a way to use only thepoint cloud topic but synchronization could still be of interest for future uses. The OpenNI drivers publisha topic with a camera_info object that contains meta data concerning the camera and how the point cloudand the image relate to each other. This data can be used in conjunction with the point cloud topic to createa projection of the 3D point cloud to a 2D image. To achieve this, the topics must have the same Header IDin their respective headers. This requires several steps to achieve.

Firstly, the callback function needs to be adjusted so that it can receive more than one topic at atime. As seen in 3.1.1, the subscribe member function can only accept one topic to listen to. The mes-sage_�lters::Synchronizer class can subscribe to multiple topics and will direct one of each to the sameinstance of the callback function.

ro s : : NodeHandle n ;2

mes sage_f i l t e r s : : Subscr iber<sensor_msgs : : PointCloud2> point_cloud_sub ;4 mes sage_f i l t e r s : : Subscr iber<sensor_msgs : : CameraInfo> camera_info_sub ;

mes sage_f i l t e r s : : Subscr iber<sensor_msgs : : Image> rgb_image_sub ;6

point_cloud_sub . sub s c r i b e (n , " camera/ rgb/ po in t s " , 1 ) ;8 camera_info_sub . sub s c r i b e (n , " camera/ rgb/camera_info " , 1 ) ;

rgb_image_sub . subs c r i b e (n , " camera/ rgb/ image_color " , 1 ) ;10

typede f mes sage_f i l t e r s : : s ync_po l i c i e s : : ApproximateTime<sensor_msgs : : CameraInfo ,12 sensor_msgs : : PointCloud2 , sensor_msgs : : Image> DataSyncPolicy ;

me s sage_f i l t e r s : : Synchronizer<DataSyncPolicy> sync ( DataSyncPolicy (2 ) ,14 camera_info_sub , point_cloud_sub , rgb_image_sub ) ;

sync . r e g i s t e rCa l l b a c k ( boost : : bind(&DataLoader : : dataCallback , th i s , _1 , _2, _3 ) ) ;

Each ros::Subscriber is replaced by a message_�lters::Subscriber and the subscribe member function iscalled. DataSyncPolicy is a typedef for describing a synchronization policy speci�c to the classes containedwithin these topics. The class is called ApproximateTime because not all topics object are sent at exactlythe same time. Objects that were sent at roughly the same time are sent to the same callback function. Themessage_�lters::Synchronizer type uses the DataSyncPolicy along with the subscribers. DataSyncPolicytakes a value as its argument to specify the size of the bu�er. As the time between topics is never exact, abu�er length greater than 1 is needed to compensate for this. I have noticed that when the bu�er is limitedto a length of 1, no callbacks ever occur. A longer bu�er will increase the rate of callbacks but this alsomeans that the bu�er will prevent the images in the callback from running in real time if they are not readfast enough.

Finally, the registerCallback member function is called with the name of the callback function as itsargument contained within a boost::bind function. The boost::bind function is taken from the boost library[29] and combines a function with its arguments to make it ready for use. The reference to the class itselfthrough this is only required when the callback function is part of a class.

In the callback function an additional set of steps are required to �nd messages with the same Header IDand transform them so that they are usable in conjunction with each other:

1 void DataLoader : : dataCal lback ( const sensor_msgs : : CameraInfoConstPtr& camera_info ,const sensor_msgs : : PointCloud2ConstPtr& point_cloud ,

3 const sensor_msgs : : ImageConstPtr& rgb_image ){

5 /// Find messsages with matching frameIDt f : : TransformListener tf_ ;


7 bool found_transform = tf_ . waitForTransform ( oriFrame . camInfo . header . frame_id ,oriFrame . c loud . header . frame_id , ro s : : Time : : now ( ) , ro s : : Duration ( 1 . 0 ) ) ;

9 ROS_ASSERT_MSG( found_transform , "Could not transform to camera frame " ) ;

11 /// Transform point_cloud to be matched with camera_infot f : : StampedTransform transform ;

13 sensor_msgs : : PointCloud2 cloudOut ;tf_ . lookupTransform ( oriFrame . camInfo . header . frame_id ,

15 oriFrame . c loud . header . frame_id , oriFrame . camInfo . header . stamp , trans form ) ;pcl_ros : : transformPointCloud ( camera_info−>header . frame_id , transform , *point_cloud , cloudOut ) ;

17/// I n i t i a l i z e camera model

19 image_geometry : : PinholeCameraModel camModel ;camModel . fromCameraInfo (* camera_info ) ;

21 }

The resulting PinholeCameraModel and transformed point cloud can then be used together to project the3D cloud points to 2D pixels:

1 pc l : : PointCloud<pc l : : PointXYZRGB> cloud ; // F i l l e d e l s ewhere

3 /// Extract one po int from the po int c loudpc l : : PointXYZRGB& pt = cloud . po in t s [ index ] ;

5/// Reproject the po intc loud onto the image

7 cv : : Point3d cv_pt ( pt . x , pt . y , pt . z ) ;cv : : Point2d uv ;

9 uv = cam_model . pro ject3dToPixe l ( cv_pt ) ;

The cv::Point2d object now has x and y attributes that specify the projected 2D coordinates of the pixels.It would, of course, be impossible to convert the pixels back to points in the cloud. It is therefore essentialnot to dispose of the point cloud.

3.1.2 OpenCV and PCL

The researchers at Willow Garage have also created two libraries that will be vital for this thesis: OpenCVand PCL.

OpenCV OpenCV is a library designed and optimized for displaying 2D images. The following code is thebasis for displaying an image using OpenCV:

1 const char WINDOW[ ] = "Kinect Display − Esc to e x i t " ;

3 /// Image data , created elsewheresensor_msgs : : ImageConstPtr& msg ;

5/// Create a new window

7 cv : : namedWindow(WINDOW, CV_WINDOW_AUTOSIZE) ;

9 /// Pointer to the OpenCV image typecv_bridge : : CvImagePtr cv_ptr ;

11/// Convert the ex terna l image

13 cv_ptr = cv_bridge : : toCvCopy (msg , sensor_msgs : : image_encodings : :BGR8) ;

15 /// Update the image shown in the windowcv : : imshow(WINDOW, cv_ptr−>image ) ;

17/// Wait for a keyboard input for 1ms, then continue

19 checkKey ( cv : : waitKey (1) ) ;

Note that the example above is used when receiving the sensor_msgs::ImageConstPtr object type directlyfrom the Kinect. cv::imshow is capable of displaying a variety of classes native to OpenCV such as the cv::Mat

3.2. STRUCTURE OF THE CODE AND FILES 25

class. cv::Mat is a simple matrix of given dimensions and layers (eg. gray-scale is one layer and BGR isthree layers). When images are constructed from scratch, the cv::Mat class is much more suitable and willbe displayed with the same cv::imshow function.

PCL PCL (Point Cloud Library) is a library created to display and manipulate point clouds. A point cloudis an array of data sets that describe a 3D space. Each element of the array contains the Cartesian (X,Y,Z)coordinates of that point in space and, if available, the RGB data or other multi-dimensional channels forthat point. The PCL library provides a class to visualize point clouds. It allows the user to move the cameraaround in a 3D space and adjust the scale. It also supports keyboard and mouse callbacks for additionalinteractivity. A basic visualizer can be created with the following code:

1 pc l : : v i s u a l i z a t i o n : : PCLVisual izer * viewer =new pc l : : v i s u a l i z a t i o n : : PCLVisual izer ("3D Viewer ") ;

3 viewer−>reg i s te rKeyboardCal lback (&kinectCloud : : keyboardEventOccurred , * th i s , 0) ;viewer−>setBackgroundColor (0 , 0 , 0) ;

5 // The argument i s the s i z e o f the arrows at the o r i g i nviewer−>addCoordinateSystem ( 0 . 3 ) ;

7 // This po int c loud i s f i l l e d e l s ewhere . I t conta in s the cloud and the rgb imagepc l : : PointCloud<pc l : : PointXYZRGB> cloud ;

9 // This ColorHandler conta in s the rgb part o f the po int c loudpc l : : v i s u a l i z a t i o n : : PointCloudColorHandlerRGBField<pc l : : PointXYZRGB> handler ( c loud ) ;

11 // Optional ly , t h i s ColorHandler can be used in s t ead o f handler// to make the cloud monochrome

13 pc l : : v i s u a l i z a t i o n : : PointCloudColorHandlerCustom<pc l : : PointXYZ>s ing l e_co l o r ( cloud ,0 , 153 , 255) ;

15 // Remove the prev ious po int c loud from the v i s u a l i z e rviewer−>removePointCloud (CLOUDNAME) ;

17 // Add the new point c loud to the v i s u a l i z e rviewer−>addPointCloud ( cloud , handler , CLOUDNAME) ;

19viewer−>spinOnce (500) ; //Run the v i s u a l i z e r f o r 500 ms

21 i f ( viewer−>wasStopped ( ) ) //Check i f the c l o s e button was c l i c k e dex i t (0 ) ;

Figure 3.1 shows the resulting viewer and an example of a point cloud that corresponds to the 2D imagein �gure 1.5.

In this thesis, I will use OpenCV and PCL alongside each other as the functionality they each provideis needed to generate the desired results. As they are part of the ROS structure, interaction between the 2libraries has been accommodated for and as such functions exist to transform data between the two formats.

3.2 Structure of the code and �les

The code created in this thesis should be usable in other projects. Because not every project will require everyaspect of this thesis, I will group parts of the functionality. The object oriented nature of C++ facilitatesthis perfectly with the use of classes. Each class will have a .cpp code �le and a .h header �le. Every classwill be designed to work as a separate unit and will receive and send inputs and outputs to other classes.

An example of this is a group of 4 classes: a data class, an image viewer, a point cloud viewer and acontroller class. The data class receives data through ROS topics, processes them and stores the �nal result.The controller then requests the data from the data class and sends only the necessary data to each respectiveviewer. This way the viewers are only reliant on the data they receive and not on how it is generated. Ifthe data is later generated through a di�erent method (or another class all together), neither the controllernor the viewers will require any changes to their code to still function. As suggested in the example above,I will use a controller class speci�c to this thesis to direct all data and actions. This corresponds with theModel-View-Controller (MVC) design methodology [30].


Figure 3.1: Point Cloud Viewer

3.2.1 Saving the frames to disk

It is not always possible to work in a space with the right conditions for the kinect. For this reason I decidedto create a method for saving all relevant information that belongs to a frame. The goal is to be able toperform every aspect of this thesis using a single, pre-recorded �le without the need for the kinect camera tobe connected.

At �rst I decided to save the camera_info, point cloud and rgb image messages as they are all that isneeded to display both the point cloud and the RGB image and have the ability to project 3D coordinatesto 2D. Near the end of the semester I developed methods that required only the original point cloud. Asmentioned before I will cover each signi�cant option that was available.

Storing the camera_info, point cloud and rgb image messages The data that needs to be saved toa single �le is that of the camera_info, point cloud and rgb image messages. Storing the original messageswill allow alternative methods of processing to be used. First I tried to do this by writing the raw objectdata to a binary �le. I soon discovered that this produced unexpected results when not serializing the classes.ROS has built in functions for serialization but the rosbag API provides a much more straightforward andsimple method. Rosbag is a tool that is part of the ROS package for storing incoming topics. It is an externaltool (not part of the code) that simulates the advertising of topics that have been recorded previously. Withthe large amount of data being sent by the Kinect, it would be impractical to use this tool directly for thisproblem. Rosbag also provides an API that can be implemented directly into the C++ code. It allows singlemessages to be stored in a common .bag �le. This is much more accurate and signi�cantly reduces the diskspace required to store a single frame. Saving messages with the rosbag API can be done as follows:

/// These messages are i n i t i a l i z e d e l s ewhere2 sensor_msgs : : PointCloud2 cloud ;

3.2. STRUCTURE OF THE CODE AND FILES 27

sensor_msgs : : CameraInfo camInfo ;4 sensor_msgs : : Image rgbImage ;

6 s t r i n g f i leName = "data . bag " ;rosbag : : Bag bag ;

8 bag . open ( fi leName , rosbag : : bagmode : : Write ) ;bag . wr i t e (" c loud " , c loud . header . stamp , c loud ) ;

10 bag . wr i t e (" camInfo " , camInfo . header . stamp , camInfo ) ;bag . wr i t e (" rgbImage " , rgbImage . header . stamp , rgbImage ) ;

12 bag . c l o s e ( ) ;

The write member function takes 3 arguments: the name of the topic (this can be anything and will beused to retrieve the topic from the bag later), a time stamp (I have chosen this to be the time stamp of thestored message) and the object to be stored. The �le name typically has a .bag extension. At a later time,the bag can be opened and de messages can be recovered with the following code:

s t r i n g f i leName = "data . bag " ;2 rosbag : : Bag bag ;

bag . open ( fi leName , rosbag : : bagmode : : Read ) ;4

std : : vector<std : : s t r i ng> top i c s ;6 t op i c s . push_back ( std : : s t r i n g (" cloud " ) ) ;

t op i c s . push_back ( std : : s t r i n g (" camInfo " ) ) ;8 t op i c s . push_back ( std : : s t r i n g (" rgbImage " ) ) ;

10 rosbag : : View view (bag , rosbag : : TopicQuery ( t op i c s ) ) ;

12 BOOST_FOREACH( rosbag : : MessageInstance const m, view ){sensor_msgs : : PointCloud2 : : ConstPtr ext_cloud = m. i n s t an t i a t e <sensor_msgs : : PointCloud2 >() ;

14 sensor_msgs : : CameraInfo : : ConstPtr ext_cam_info = m. i n s t an t i a t e <sensor_msgs : : CameraInfo >() ;sensor_msgs : : Image : : ConstPtr ext_rgb_image = m. i n s t an t i a t e <sensor_msgs : : Image >() ;

16 }

The bag �le is opened in read mode, a vector of topics is created and a rosbag::View object is created forthe bag �le. The loop then cycles through each topic and a check is done to see if a message type is matched.It is important to note that the instantiate member function returns a pointer. If this code is run within aseparate function, the data will need to be copied if it is to be retained.

Storing a single point cloud When a single point cloud needs to be saved to disk, a PCL function canbe used that is speci�cally designed for this purpose. The functions for saving and loading a point cloudrespectively are:

s t r i n g f i leName = "Name_of_file . pcd " ;2 pc l : : PointCloud<pc l : : PointXYZRGB> pclCloud ;

pc l : : i o : : savePCDFile<pc l : : PointXYZRGB>(fileName , pclCloud ) ;4 pc l : : i o : : loadPCDFile<pc l : : PointXYZRGB>(fileName , pclCloud ) ;

This saves the point cloud as PCL's native .pcd (point cloud data) format. This has several advantagesover using the rosbag method. It is much simpler to implement as it only requires one function for eachoperation and the format allows for data compression. Using pcd �les also allows the saved point clouds tobe used by other PCL programs because it's the established standard. Point clouds from other projects canalso be used as long as they are compatible with this project. Two conditions need to be satis�ed for this:

� The cloud needs to be dense. This means that the number of points are equal to the original resolution.Some �lters remove points from the cloud. This results in the .at(x,y) member function of the pointcloud no longer working because the location of the points is now no longer continuous. The onlyway to determine the 2D coordinates of a point is to then combine it with the original correspondingcamera_info object which will not be available. Useful functions such as a passthrough �lter (which


removes points outside a given region of interest) are therefore o� limits because the resulting cloud isno longer dense.

� The point cloud must have points of the type pcl::PointXYZRGB. These contain both the Cartesiancoordinates of the point and the RGB data.

It is clear that storing a single .pcd �le is the better choice though additional care needs to be taken to keepthe cloud dense. Some automated functionality in PCL will need to be done manually to achieve this.

3.3 Summary

The ROS platform will be used to create the code in this thesis. The topics functionality in ROS will beused to communicate data from the Kinect driver to the software written in this thesis. In addition, the PCLand OpenCV libraries will be needed to process the data and produce the desired results. PCL and OpenCVeach have their own viewer to visualize 3D point clouds and 2D images respectively.

A �le structure will be used that corresponds with the modular structure of the code with a code �le(.cpp) and header (.h) for each class. The whole program will be written with a Model-View-Controllerdesign, keeping data separate from visualization. In order to enable the program to be used without the needfor a Kinect to be connected, point clouds will be saved to .pdc �les. These are native to PCL and will allowthe entire data structure to be reconstructed as if it came from a Kinect.

Chapter 4

Background Removal

This chapter will cover the removal of background from an image. The goal is to have a �lter that leavesonly the person and nothing else. If other objects are still present in the scene, it will be signi�cantly moredi�cult to estimate the position of the skeleton. The most important priority in this chapter is thus tomake sure that only the person remains. In the course of �nding a suitable method, I tried several that wereinadequate. Section 4.1 covers several of these methods to demonstrate why they were unsuitable. The endof this section will explain why my chosen method is superior to the rest and will assess whether it producesthe desired result. Section 4.2 explores the combination of the method found in section 4.1 with Grabcut tosee if the results show enough improvement to justify the added processing time.

4.1 OpenCV vs PCL

The �rst step in generating a skeleton is segmenting the image to separate the background from the fore-ground. Two possible approaches are available for this: manipulating the RGB image directly in OpenCV(2D) or �ltering the point cloud in PCL (3D). Eventually a mask will need to be created in 2D to applyto the image but the question here is which method to use. This section will describe the various methodsthat I tried and cover what the advantages and disadvantages are. In the end, a satisfactory method will bechosen.

4.1.1 Grabcut (OpenCV)

Grabcut is an algorithm for segmenting an image along lines that reduce the energy function as signi�cantlyas possible (see the Literary Study in chapter 2 for a more detailed description). In this thesis, it will be usedto automatically separate a person from his/her background. The most basic version of Grabcut requires theuser to specify a rectangle that is drawn around the person. Anything outside the rectangle will be regardedas de�nitely background and the background within the rectangle will be calculated based on features fromthe area outside of it. If the background is simple without pronounced edges and colours similar to thosethat make up the person, this should su�ce and will deliver satisfactory results. These sorts of backgroundsdo not, however, occur much outside of photo shoots and thus additional user input is needed.

Additional pixels can be speci�ed as being back or foreground. These can be said to be de�nite orprobable. The di�erence is that the Grabcut algorithm will assume that de�nite inputs do not need to be

29

30 CHAPTER 4. BACKGROUND REMOVAL

changed but that probable inputs can be rejected if it ends up improving the results. The same applies forforeground. The header �le for Grabcut de�nes these states as:

� GC_FGD - de�nitely foreground (= 0)

� GC_BGD - de�nitely background (= 1)

� GC_PR_FGD - probably foreground (= 2)

� GC_PR_BGD - probably background (= 3)

A mask in the cv::Mat format with the same size as the image is used for this. Each element in the maskcorresponds to a pixel in the image and de�nes it's state.

The function prototype for Grabcut in OpenCV is:

void Grabcut ( const Mat& image , Mat& mask , Rect rect , Mat& bgdModel , Mat& fgdModel , i n t iterCount , i n t mode ) ;

image is the image that Grabcut will be applied to and mask is the matrix that speci�es for each pixelwhether it belongs to the background or foreground. It is important to note that the Grabcut function doesnot change the image in any way, nor does it return anything. All it does is apply its algorithm to the imageand change the mask accordingly. The mask must then be applied to the image at a later stage. rect is therectangle that can be used to specify a region of interest (ROI) for initializing the mask. It is only used onetime and only if the selected mode allows it but it must be present as a parameter regardless of the mode(even if it is a rectangle of 0 by 0). bgdModel and fgdModel are mandatory matrices that the user need onlycreate and leave as is during the iterative use of Grabcut. iterCount is the number of iterations that must beperformed. The Grabcut algorithm works in such a way that a higher number of iterations produce betterresults but naturally requires more computation time. A compromise must therefore be made between resultsand performance. mode sets the operation mode for Grabcut. The possible values are:

� GC_INIT_WITH_RECT - Initializes the state and the mask using the provided rectangle. Al pixelswithin the ROI are are automatically initialized with GC_BGD. After that it runs iterCount iterationsof the algorithm.

� GC_INIT_WITH_MASK - Initializes the state using the provided mask. This is automatically com-bined with GC_INIT_WITH_RECT so that all the pixels outside of the region of interest, speci�edby rect, are automatically initialized with GC_BGD. After that it runs iterCount iterations of thealgorithm.

� GC_EVAL - This means that algorithm should just resume for iterCount iterations. This would becalled if GC_INIT_WITH_RECT or GC_INIT_WITH_MASK have already been used.

� If no mode is supplied, GC_EVAL is the default value.

As it is impossible to create a mask out of thin air, using Grabcut by itself means that it must be initializedwith a rectangle. It must therefore be assumed that the person will be within a certain region of interest. Intheory this could be enough but, as mentioned before, the background must be simple enough to facilitateusing Grabcut without additional user interaction. The preparation of a more speci�c mask must �rst bedone using other methods.

4.1. OPENCV VS PCL 31

Figure 4.1: Results from using Grabcut (with user interaction)

Results The example in �gure 4.1 was made with a sample program that demonstrates Grabcut when itis used by itself. The green rectangle is used to initialize the cut. The stones are too similar to the tigerand by extension some grass is also included. The blue and red lines are user input to indicate backgroundand foreground respectively. The results are better but still not perfect. In combination with the accuracyprovided by the depth in PCL, the results could be much better. If a red line had been drawn on the tail, itwould have been included too.

4.1.2 Thresholding along the x, y and z axes (PCL)

A point cloud can be thesholded along the x, y and z axes in 2 ways using PCL. The �rst is using thepcl::PassThrough function which is referred to as a �lter. This will permanently remove points from a cloud,leaving only those within the speci�ed margins. The cloud is then no longer dense, which is not desirable(see section 3.2.1). The function is implemented using the following code:

1 pc l : : PointCloud<pc l : : PointXYZRGB> cloud ; // the po int cloud , c r eated e l s ewherepc l : : PointCloud<pc l : : PointXYZRGB> cloudOut ; // the po int c loud to which the r e s u l t w i l l be saved

3 pc l : : PassThrough<pc l : : PointXYZRGB> pass ; // c r ea t e a new pc l : : PassThrough ob j e c tpass . setInputCloud ( cloud ) ; // s e t the po int c loud as the input o f the f i l t e r

5/**

7 * The f o l l ow ing 3 member func t i on s should be repeated* f o r each o f the axes that need to be adjusted .

9 */// The ax i s a long which po in t s w i l l be added or removed . Po s s i b l e inputs here are

11 // "x" , "y" and "z" f o r the r e s p e c t i v e c a r t e s i a n axes .pass . s e tF i l t e rF ie ldName ("x") ;

13 // The arguments are two doubles to that d e f i n e the boundries , measured from the o r i g i npass . s e tF i l t e r L im i t s ( trimXLeft , trimXRight ) ;

15 // This func t i on app l i e s the f i l t e r to the input cloud and s t o r e s i t in the g iven cloud .// The o r i g i n a l c loud i s only changed i f the output c loud i s the same as the input cloud .

17 pass . f i l t e r (* cloudOut ) ;

The second method is by inspecting the cloud point by point, comparing the coordinates of each withthe set boundaries. This requires more code but gives the programmer more freedom. The main advantageis that the cloud remains dense. Another advantage to inspecting the points separately is the reductionof redundant work. If the cloud needs to converted to a 2D image, each point will have to be convertedseparately anyway and thus the thresholding can be done at the same time, making the processing time ofthe PassThrough �lter excessive. The following is an example of cycling through the points in a cloud:

1 pc l : : PointCloud<pc l : : PointXYZRGB> cloud ; // the po int cloud , c r eated e l s ewhereBOOST_FOREACH ( const pc l : : PointXYZRGB& pt , c loud . po in t s ) {

3 i f ( pt . z > near_l imit && pt . z < fa r_ l im i t ) {// Do something when the po int i s l o ca t ed with the range on the z−ax i s


5 }}

It is important to note that this method will also work with the points from a PassThrough �lter becauseit does not specify where each point is located in the cloud.points array. The x-y coordinates, however, arelost when performing the PassThrough �lter and another method will have to be utilized to recover them.

Results If it is assumed that the person is located within a set space along all three axes, it is possibleto separate the background from the foreground. However, if other objects are located within that region,they will also be part of the �ltered image, which is a problem. The accuracy of the point cloud can alsobe relatively low due to using a resolution of 640x480 to represent a space of more than 5 meters in eachdirection. In addition to this, the IR projection originates from a single point, di�using as the depth of thespace increases. The result is that the di�erences in registered z-values keep increasing with the distancefrom the camera, making it harder to discern whether an object is part of the back or foreground. Parts thatshould be foreground will therefore seem to be background and visa versa because they were on the thresholdbetween the two depth levels. Strong ambient lighting can also make the depth values highly unreliable as itinterferes with the IR projection.

Conclusion If the person is in an empty room with nothing at the same depth, this method could beenough. If any other object is at the same depth range as the person (including the walls), it will be includedin the �nal image. For the whole body to be in the image, the person needs to stand far enough from thecamera which in turn reduces the accuracy of the depth values. The �oor also needs to be taken into accountand as the elevation of the camera is unknown, it is very di�cult to estimate how much of the y-axis willhave to be removed without losing to much of the person. This method is therefore a good basis but notsu�cient.

4.1.3 Segmenting the surfaces in the point cloud

PCL has a set of functions to �nd groups of pixels that, when put together, will form a surface. The followingcode can accomplish this:

// Cloud from Kinect , populated e l s ewhere2 pc l : : PointCloud<pc l : : PointXYZRGB>:: Ptr c loudIn ;

pc l : : PointCloud<pc l : : PointXYZRGB> cloudTemp ;4

pc l : : Mode lCoe f f i c i en t s : : Ptr c o e f f i c i e n t s (new pc l : : Mode lCoe f f i c i en t s ( ) ) ;6 pc l : : Po in t Ind i c e s : : Ptr i n l i e r s (new pc l : : Po in t Ind i c e s ( ) ) ;

8 // Create the segmentation ob j e c tpc l : : SACSegmentation<pc l : : PointXYZRGB> seg ;

10 // Optionalseg . s e tOp t im i z eCoe f f i c i e n t s ( t rue ) ;

12 seg . s e tAx i s ( Eigen : : Vector3 f ( 0 , 0 , 1 ) ) ; //Z−ax i sseg . setEpsAngle ( 5 . 0/180*3 . 14 ) ; // rad

14 // Mandatoryseg . setModelType ( pc l : :SACMODEL_PERPENDICULAR_PLANE) ;

16 seg . setMethodType ( pc l : :SAC_RANSAC) ;seg . s e tMaxI te ra t i ons (1000) ;

18 seg . se tDis tanceThresho ld ( 0 . 1 ) ;

20 // Create the f i l t e r i n g ob j e c tpc l : : Ext rac t Ind i ce s<pc l : : PointXYZRGB> ext ra c t ;

22// Storage f o r the found su r f a c e s

24 vector< pc l : : PointCloud<pc l : : PointXYZRGB>:: Ptr > subClouds ;subClouds . c l e a r ( ) ;


26// Number o f po in t s in o r i g i n a l c loud

28 in t nr_points = ( i n t ) cloudIn−>po int s . s i z e ( ) ;

30 // While 30% of the o r i g i n a l c loud i s s t i l l the rewhi le ( cloudIn−>po int s . s i z e ( ) > 0 .3 * nr_points ) {

32 // Segment the l a r g e s t p lanar component from the remaining cloudseg . setInputCloud ( c loudIn ) ;

34 seg . segment (* i n l i e r s , * c o e f f i c i e n t s ) ;i f ( i n l i e r s −>ind i c e s . s i z e ( ) == 0) {

36 std : : c e r r << "Could not es t imate a planar model f o r the g iven datase t . " << std : :endl ;

break ;38 }

// Extract the i n l i e r s to a new cloud40 ex t ra c t . setInputCloud ( c loudIn ) ;

ex t r a c t . s e tNegat ive ( f a l s e ) ;42 ex t r a c t . f i l t e r ( cloudTemp ) ;

44 // Remove the i n l i e r s from the input cloud ( reduces the s i z e o f the c loud )ex t r a c t . s e tNegat ive ( t rue ) ;

46 ex t r a c t . f i l t e r (* c loudIn ) ;

48 subClouds . push_back ( pc l : : PointCloud<pc l : : PointXYZRGB>:: Ptr(new pc l : : PointCloud<pc l : : PointXYZRGB>(cloudTemp ) ) ) ;

50 }

The program uses the RANSAC [31] segmentation method to �nd corresponding points. setAxis, setEpsAngleand setDistanceThreshold set 3 parameters that can be adjusted to change the results. The axis is not mandat-ory but as we are looking for planes that are perpendicular to the z-axis, it improves the results signi�cantly.setEpsAngle allows a margin of error on that axis. This is important as the human shape is not �at butwhen the body is facing forward, the surface is more or less perpendicular to the z-axis. setDistanceThresholdallows the programmer to set how spread out the points are permitted to be yet still belong to the sameplane. For a more diverse scene, this value needs to be lower to prevent objects from being merged.

The program loops through the algorithm, each time �nding and removing the largest plane until thenumber of points that remain in the cloud is below a threshold or until no more planes can be found. Thedetected planes are stored in a vector of separate point clouds. The points can then be inspected afterwards.A possibility would be to use the cloud containing the centre pixel to represent the person. It is is importantto note that these clouds are not dense! It is possible to cycle through the pcl::PointIndices object (inliers)and extract the 2D coordinates that correspond to the original point cloud separately. The following codewill accomplish this:

f o r ( i n t k=0; k<f l o o r . i n d i c e s . s i z e ( ) ; k++){2 i n t x = f l o o r . i n d i c e s [ k]%cloud . width ;

i n t y = f l o o r . i n d i c e s [ k ] / c loud . width ;4 /* Do something with these coo rd ina t e s */

}

This method can only be used with the �rst cloud that is created. Subsequent clouds are no longer denseand an such the position of the points in the cloud no longer conform to the formula above. Put simply, itis only possible to extract the largest plane from a cloud if you want it to remain dense.

Results Theoretically this algorithm could provide exactly what is needed. In practice, however, it is notoptimal. The setAxis, setEpsAngle and setDistanceThreshold variables can be �ne tuned to produce accept-able results for a speci�c scene but when applied to another scene, the quality of those results deteriorates.Figure 4.2a and �gure 4.2b demonstrate this. Each colour is a di�erent point cloud (and a di�erent plane).Figure 4.2a is a close-up while 4.2b is from far away with a lot of extra content in the room. In each example


(a) (b)

Figure 4.2: (a) Surface Segmentation (Close-up) (b) Surface Segmentation (Far)

the person is in the centre of the scene but the results for 4.2b are useless. Note that in both cases, the faceis missing. This is because the curvature of that surface is too much for the setEpsAngle value.

Conclusion The results for this method are completely unsatisfactory. It could work for close-ups incombination with Grabcut to �ll in the gaps but a certain versatility is still needed that this method cannotprovide. I did �nd that it is useful for �nding the �oor or the ceiling and have implemented this method in my�nal code exclusively for that purpose. Though the �rst priority in this thesis is results, it is very importantto mention that this method is very processor heavy. It takes a couple of seconds per surface found, makingit completely unsuitable for online implementations.

4.1.4 Segmentation based on colour

During my research into background removal, another function was added to PCL that showed promise foraccomplishing my goals. The class is pcl::SeededHueSegmentation. The principle is that, given an array ofpixels that de�nitely belong to the target object, the algorithm will �nd pixels with a similar hue. Hue is partof the HSV (Hue-Satuaration-Value), colour space show in �gure 4.3. Hue is the pure colour value withouttaking saturation or value (light intensity) into account. Most clothes have very similar hue and so does skin.Most di�erences will be in intensity due to shadows which is why only the hue is compared. If the pixels thatare initially provided are representative of the various areas of the body and the background is su�cientlydi�erent in hue, the algorithm would be expected to �nd the whole person.

The following code demonstrates the use of the pcl::SeededHueSegmentation class:

1 // Cloud from Kinect , populated e l s ewherepc l : : PointCloud<pc l : : PointXYZRGB>:: Ptr c loudIn ;

3 pc l : : s earch : : KdTree<pc l : : PointXYZRGB>:: Ptr t r e e(new pc l : : s earch : : KdTree<pc l : : PointXYZRGB>()) ;

5 pc l : : Po in t Ind i c e s base , found ;/*

7 * Base i s f i l l e d with the i n i t i a l p i x e l s us ing the func t i on* base . i n d i c e s . push_back (y* cloudIn−>width + x ) ;

9 */

11 // Run the a lgor i thmpc l : : SeededHueSegmentation shs ;


Figure 4.3: HSV Colour Space[26]

13 shs . setSearchMethod ( t r e e ) ;shs . s e tC lu s t e rTo l e rance ( 0 . 1 ) ;

15 shs . setDeltaHue ( 0 . 1 ) ;shs . setInputCloud ( c loudIn ) ;

17 shs . segment ( base , found ) ;

19 // Extract the po in t s found to a new cloudpc l : : PointCloud<pc l : : PointXYZRGB> cloudOut ;

21 pc l : : Ext rac t Ind ice s<pc l : : PointXYZRGB> ext ra c t ;ex t r a c t . setInputCloud ( c loudIn ) ;

23 pc l : : Po in t Ind i c e s : : Ptr foundPtr(&found ) ;ex t r a c t . s e t I n d i c e s ( foundPtr ) ;

25 ex t r a c t . s e tNegat ive ( f a l s e ) ;ex t r a c t . f i l t e r ( cloudOut ) ;

First, the variable base is �lled with pixels of which it is certain that they are part of the person. Forthis I used a vertical line of about 200 pixels through the centre of the image (but also varied the length). Iassumed that the line would run through the chest and part of the face and with the examples that I used,I made sure that it did for the sake of the experiment. The two functions that can be �ne tuned to a�ectthe results are setClusterTolerance and setDeltaHue. The �rst sets the relative distance allowed between thepoints in order to still belong to the same cluster. The latter is the accepted variation in hue. Increasing thiswill result in more points being accepted but may lower the accuracy.

After the algorithm has run its course, the points are extracted to a new point cloud. This new cloud isnot dense. Therefore the same method as in section 4.1.3 can be used to manually extract the found pixellocations.

Results Using a straight vertical line for the initial values is a safe way to include the person's chest andhead (though depending on how far the person is standing from the camera, either the head may not be partof the line or part of the background may be included). It is, however, very di�cult to predict where theperson's arms will be. The requirement for this method to work is not to include each part of the body in theinitial values but instead a sample from each colour area. Figure 4.4b shows an example of this. A line canbe seen that covers the T-shirt, the top of the trousers and the face (highlighted for the sake of illustration).The trousers are roughly the same colour and the shirt continues through to the upper arms but the barearms are a completely di�erent hue that is not part of the initial sample. Figure 4.4b shows the results fromthe SeededHueSegmentation function after the �oor was removed using the method from section 4.1.3. Thedata was then fed into the Grabcut algorithm. An additional short horizontal line was drawn to include moresamples. The black specks are the output from the SeededHueSegmentation. These are assigned the value


(a) (b) (c)

Figure 4.4: (a) Original Image (b) SeededHueSegmentation + Grabcut Mask (c) Original Image + Mask

GC_FGD. A square is drawn around these points with the value GC_PR_FGD because otherwise Grabcutbarely returnes anything. The resulting mask is applied to the original image and the result is shown in�gure 4.4c.

Firstly, because of the need to add an area of GC_PR_FGD values, some of the background was includedwhich is not desirable. One arm was found by the segmentation while the other was not. This may be becauseof the shadow that the sleeve casts on one arm which is not present on the other. The rest of the result isquite precise but it is mainly a product of Grabcut and not of SeededHueSegmentation.

Conclusion For background removal, this method is quite worthless. If the object was uniform in colour,it would most likely work quite nicely but not for a target as varied in hue as a clothed human. Most of thework was done by Grabcut which proves the hypothesis that the combination of Grabcut with PCL �lteringcould improve results.

It may be possible to use this method to map the area of the hands and face as a starting point fordrawing the skeleton.

4.1.5 Segmentation through connected components

This method is a combination of some of the previous methods and some new concepts. It produces verygood results and will consequently be used for the remaining duration of this thesis. Connected componentanalysis is a sub-category of graph theory. The idea to is to apply a threshold to an image and to then labelthe pixels that touch along at least one of 4 or 8 directions. Each group gets a unique label as shown in �gure4.5. A number of stages are required to guarantee the best results:

Assumptions

� The person is located roughly in the middle of the scene (the centre pixel is part of the person).

� The (small) centre area of the scene is part of the person's chest.


Figure 4.5: Connected components

� The person's arms and legs are not behind the chest (but can be in front).

� The person is not touching any other objects.

� Other objects are su�ciently far away from the person so that they do not receive the same label whena binary mask is created (see below).

� The �oor is the largest continuous horizontal plane in the scene.

Filtering the cloud along the z-axis A binary mask requires that each pixel is labelled as either fore-ground or background (0 or 1). In traditional applications of connected components, this is done by applyinga threshold on the grey scale values. To fully utilize what is available, the threshold is instead applied tothe depth values in the point cloud. The average depth of the centre area is calculated. An average istaken in case the centre pixel is an anomaly or has a NaN (not a number) value due to interference with theIR-projection. This depth is now a representation of the depth of the chest. A threshold is then applied at adepth of roughly 30cm behind that point. This should encompass the entirety of a person facing the camera.Anything behind that depth is background and anything before it is foreground. A cv::Mat binary mask isthen created using this data.

As stated in the assumptions, the arms and legs must stay next to or in front of the chest (which isreasonable when interacting with a robot). The assumptions are very important for this step to produceusable results.

Removing the �oor In a typical scene, the �oor will be present in both the fore and background. It canbe assumed that the �oor is the largest horizontal plane in the scene. When the segementation method fromsection 4.1.3 is used for only one loop and the axis is changed to the y-axis, the largest horizontal plane willbe returned which will be representative of the �oor. The found pixels are then marked as background inthe mask. A few issues arise with this. Because the camera is not certain to be perfectly horizontal, thevirtual y-axis may not align with the real world one. For this reason, a margin of error must be permittedin the setEpsAngle value. The consequence of this is that some objects are removed along with the �oor. A


signi�cant example of these are the feet. While this excludes part of the person in the �nal �ltered image, itwill most likely make the creation of the skeleton easier as the orientation of the feet is no longer somethingthat needs to be accounted for. The signi�cance of the feet in a skeleton is minor and as such this can beseen as a useful side e�ect rather than an error.

This step is de�nitely the bottleneck of the algorithm. While the others take fractions of a second, �ndinga single plane in the complete cloud can take up to 8 seconds. To reduce this time, the points that lie beyondthe depth threshold are removed. To keep the cloud dense, this is done by setting the depth value for thosepoints manually to NaN (Not a Number) which makes them irrelevant to the PCL functions. Because mostof an image is contained in the background, the number of points that need to be processed has now beensigni�cantly reduced. As a result, the entire background removal procedure takes 200 to 800 ms, dependingon how close the person is to the camera. This reduces the processing time by at least a factor of 10.

Labelling the connected components OpenCV provides a function to �nd the connected componentsand return a list of their outlines. With a separate function these outlines can then be �lled. OpenCv doesnot currently contain a more direct method. With the transition from OpenCV 1.0 to version 2.0, a lot ofthe old functions and object types are still backwards compatible but functions require the types for theirspeci�c versions. The cv::Mat type is part of the new 2.0 set while the labelling functions are part of theolder version. The old labelling functions have been converted to the C++ interface but they contain bugsthat are yet to be resolved. Though this barely has an e�ect on the speed of the process, using the C++version when it has been �xed could make it run just a little faster.

Once all the connected components are labelled, the algorithm determines which group contains thecentre pixel (or one in the vicinity when the centre depth is NaN). Each pixel in that group is then markedas foreground in a new mask with the rest as background. At this point the person has been separated fromthe background. All that is needed is to apply the mask to the original image.

Dilation and erosion (optional) Figure 4.6a shows the result from the algorithm up to this point. It isthe most representative outline of the person but because it will need to be used to generate a skeleton, itmay not be the most workable. The rough edges and strange shape of the head due to interference with theIR-light will make it very di�cult to create a clean, smooth skeleton. Applying a combination of dilationand erosion to the mask will produce a much more desirable mask for later steps. Dilation will grow theforeground pixels in the mask and erosion will shrink them. Figure 4.6b shows the improvements that thiscan provide. The remnants from the feet are removed and the head is more uniform. The edges are alsostraighter and less irregular. The order in which dilation and erosion are applied is crucial and will need toexperimented with. In extreme cases the legs could, for example, become one entity. See the programmingcode in the function Filters::createFGMask of the attached code for this thesis for the �nal sequence used.

Results Figure 4.6c shows the result from the mask in �gure 4.6b. The source image is the same as in�gure 4.4a. The result is very good for what it will be used for. If the mask would be used for an actualvisual representation of the person, a lot more accuracy and smoothing of the edges would be needed butfor estimating a skeleton, it is more than su�cient. The legs have a clear beginning (unlike in �gure 4.6a)and though the head has a strange shape on the top, it is mostly symmetrical and the sudden change inshape can be anticipated. Due to the eroding and dilating, a small part of the background has been addedto the edges and some other thin edges have been removed. This is also a side e�ect from the method usedby Kinect to calculate depth. The strange shape of the head can also be attributed to this because my darkhair and gel interfere with the IR-light re�ections.

4.2. COMBINATION WITH GRABCUT 39

(a) (b) (c)

Figure 4.6: (a) Untouched mask (b) Mask after Erosion and Dilation (c) Final Image (with mask applied)

Conclusion This method is by far superior to the other methods listed above. It is a combination ofthe depth thresholding and point cloud surface segmentation, along with connected component labelling.It uses their advantages and removes most of the weaknesses. If the assumptions are met (which shouldbe reasonable), there should be no interference from the environment. The edges are sharp and containa minimum of the background. Using erosion and dilation the smoothness of the edges and unwantedprotrusions can be removed. The main downside of this method is the added time needed for �nding the�oor. The basis for the algorithm is, however, present and will produce su�ciently adequate results for therest of the thesis.

4.2 Combination with Grabcut

Grabcut seemed to be a promising method for background removal but it proved to be insu�ciently reliable.It can still be used to supplement the chosen method but the added processing time needed to perform thisstep will be excessive if the background removal is time critical.

Most of the depth values at the edge of the person actually belong to the background. This is causedby a faulty recognition of the IR projection at the edge of the body. A few iterations of Grabcut withthe background set as de�nite background and the foreground as probable foreground could remove theseunwanted pixels. Beyond this, the use of Grabcut for the purposes of this thesis is limited. Even if it wereto work admirably, it would still be mostly counterproductive as dilation and erosion are applied to smoothout the edges of the mask.

4.3 Summary

This chapter covered the various methods that were experimented with to produce an algorithm for back-ground removal. Plain �ltering along the z-axis in PCL leaves too much of the surroundings in the image.


Segmenting the point cloud based on continuous planes perpendicular to the z-axis produces mixed resultsthat are too unpredictable to be used for background removal but can be used to �nd the �oor in a scene,albeit with a high cost in processing time. Segmentation based on the hue of the pixels is completely unsuit-able for the purposes of this thesis. In combination with Grabcut, some very rough results can be attainedbut they are far from acceptable. The method that will be used for the remainder of this thesis is basedon analysing connected components. First, a binary mask is created by thresholding the depth values ofthe point cloud. Then, after removing the �oor, each component in the mask is labelled to separate theconnected components. The one that contains the centre pixel is isolated and after some optional dilationand erosion, the �nal mask is obtained. Grabcut can then be used to trim the false edges that still belong tothe background but can be left out is the application is time critical.

Note that the test image for the last two methods contained a blue background. This was done to givethe algorithms that are based on colour di�erentiation, such as Grabcut, a greater �ghting chance. If theyworked with that background, they could then be expanded to work with more random backgrounds too.The �nal method is not a�ected by colour, only depth, and therefore the colour of the background becomesirrelevant.

Chapter 5

Generating the Skeleton

This chapter will cover the algorithms developed for the creation of the 2D and 3D skeletons. The 3D skeletonwill be based on the 2D version by projecting it into the 3D space.

Many images in this section have been inverted (black on white instead of white on black) to make thee�ects more visible and make printing them more environmentally friendly.

Section 5.1 details the development of the 2D skeleton and section 5.2 shows how it was then convertedto 3D.

5.1 Two-dimensional skeleton algorithm

As touched upon in the literary study, the de�nition of a pseudo-skeleton that I will use in this thesis is aset of lines that run equidistant from the edges that it runs parallel to. It is referred to in the literature as atopological skeleton and can best be compared to a stick-man (without the circle for a head). This is a basicrepresentation of a human skeleton but will su�ce for many practical applications where no more than theposture of the person is needed. Throughout this section, the term �zero-pixel� will be used to represent apixel that lies outside the body and thus has a value of 0. A distance transform calculates the distance tothe closest of these zero-pixels for each individual pixel.

Figure 5.1a shows an example of the results this algorithm produces. It is based on the image in �gure5.1b from which the background was �rst removed using the algorithm in section 4.1.5.

5.1.1 Required assumptions

Given the assumptions in section 4.1.5, one extra assumption is vital to creating a 2D skeleton: the arms andlegs must be clearly di�erentiated from the torso and each other. This means they are not allowed to touch.If they were to touch, it would seem as if they are combined into one body part and the skeleton would rundown the centre of that one mass instead of multiple lines for each of the parts.

5.1.2 Starting point & general strategy

Removing the background provides a binary mask that shows the outline of the person. When a distancetransform is applied to the mask, the desired skeleton becomes immediately visible to the human eye. Thiscan be seen in �gure 5.2a. The highlights that run down the centre of each body part are exactly what is

41

42 CHAPTER 5. GENERATING THE SKELETON

(a) (b)

Figure 5.1: (a) 2D skeleton (b) Test image for skeleton creation

needed. Unfortunately, a computer does not see this line. I tried applying a watershed algorithm to theimage, hoping it would result in the skeletal lines but due to the gradations between the lines that runperpendicular to the skeleton, the result is completely useless. A simple threshold was not possible eitherdue to the grey level gradient on the skeletal lines themselves. This meant that a complete algorithm wasneeded to steer the program along what a human could perceive but the computer could not.

The �rst step is to �nd a starting point. The highest value on the distance transform will be at theintersection of the arms and the chest as indicated with the circle on �gure 5.2a. The reason for this is thatat that point, the shortest path to a zero-pixel is diagonally to the shoulders or the armpits. This point willbe referred to as the centerPoint. From this point, all six body parts can be found: the head, 2 arms, thetorso and 2 legs of which 4 can immediately be found by going up, down, left and right. To locate the legs,the torso must be followed until a branch is detected (see section 5.1.4). Once the initial direction has beendetermined for a body part, the algorithm merely has to follow the line until it detects the end of that part(see section 5.1.5).

5.1.3 Following the skeleton

The most useful property of the skeleton that is visible on the distance transform is that the pixels on eitherside of the line have a lower value than the line itself. A second useful property is that sequentially followingthe neighbouring pixel with the lowest value will result in the shortest path to the nearest zero-pixel. On adistance transform, two such paths will intersect at a point which is located on the skeleton. Using these 2properties, the following sequence can be used to �nd a skeleton point (illustrated in �gure 5.3):

1. The algorithm starts with a pixel that lies close to the skeletal line. This pixel lies on either side ofthe skeleton. We'll call this point estPixel. We are also given a vector that points in the direction of

5.1. TWO-DIMENSIONAL SKELETON ALGORITHM 43

(a) (b)

Figure 5.2: (a) Result of a distance transform(inverted) (b) Following a vector (blue) through a method thatutilises steps (green) instead of full Cartesian coordinates (red)

where the next estimated skeletal point lies, called the nextVector. Initially, this is provided by theup, down, left and right directions, starting at the centerPoint. With each iteration, the nextVector isrecalculated.

2. Starting at estPixel, repeatedly follow the neighbouring pixel with the lowest value. As pointed outbefore, this will lead us to the nearest zero-pixel. We'll call this point zeroPixel1.

3. Follow the vector between estPixel and zeroPixel1 in the opposite direction, starting at estPixel, until apeak is reached in the value of the pixels. This will be a point on the skeleton. We'll call this skelPixel.

4. Next, a pixel needs to be found that lies on the other side of the skeletal line. It is very important thatthis pixel does not lie on the same side as zeroPixel1. To achieve this, a few pixels are followed alonga line that is perpendicular to the nextVector which will provide the shortest path to the other side.From this point, the shortest path to the nearest zero-pixel is found using the same method as before.This point is zeroPixel2.

5. A vector perpendicular to the line that connects zeroPixel1 and zeroPixel2 and with the same generaldirection as nextVector is calculated. This becomes the new nextVector. Starting at skelPixel, a vectorwith the length of a few pixels is followed along the new nextVector. The result of this becomes thenew estPixel.

This algorithm is iterated upon until the end of that body part is detected by a di�erent algorithm andthis is repeated for each body part. Due to the many gradients in the distance transform, an extra checkis implemented to ensure that the skeleton line does not suddenly jump onto a line perpendicular to theskeleton and start following it instead. The check calculates the angle between the previous nextVector andthe new one. If the angle is greater than 30°, the previous nextVector is used in combination with the new


Figure 5.3: Procedure for following the 2D skeleton

estPixel. I observed from practical tests that this anomaly will then correct itself within the next few skeletonpoints. Using an angle of 30° still allows the body to be bent at places such as the arms as the shape doesnot contain any sharp corners.

Following a line along a vector The method described above often requires the algorithm to follow avector for a distance that is di�erent from the size of the vector. Because images are based on Cartesianinstead of polar coordinates, it is not possible to simply draw a line with a speci�c length along a givendirection. Because of the discontinuous nature of pixels, it is also rarely possible to draw a line of an exactlength along a given vector. The following method approximates the needed lines as closely as possible:

� When a grid is superimposed upon a vector, it becomes clear that it is a combination of horizontaland vertical steps. This combination can be approached in two ways: either the full horizon steps aretaken, followed by the full vertical steps (or visa versa) or one step is taken in the shortest directionand a proportional amount of steps in the other. This is then repeated until the full length has beenreached.

� In order to follow a vector for a distance other than its own euclidean distance, the second methodneeds to be utilised. This will ensure that the di�erence between the estimated and the original vectorremains as small as possible. Figure 5.2b shows the di�erence between these approaches. The blue lineis the vector, the red line shows the �rst method and the green steps demonstrate the method used inthis thesis. At any point along the vector or its extension the desired line can be stopped, provided ithas integer coordinates.

This algorithm has been given its own class in the �le vector_tracer.cpp so that it can be easily called fromany class.

5.1.4 Detecting the legs

At a certain point, when following the torso, the skeletal line will branch into 2 legs. This branch must bedetected in some way. During the development process, I created two methods to achieve this. The �rst usedthe following method:

5.1. TWO-DIMENSIONAL SKELETON ALGORITHM 45

(a) (b) (c)

Figure 5.4: (a) Cross-section before and after the branching of the legs (b) Distance transform values abovethe branch (c) Distance transform values below the branch

� A cross-section of the distance transform perpendicular to the spine has one maximum, the spine itself.Figure 5.4b demonstrates this.

� Once that cross-section encounters the legs, there will be 2 maxima (one for each leg) as shown in �gure5.4c. Detecting this change will provide the starting point for the legs.

Unfortunately this method was insu�ciently reliable. A distance transform is made up of a hierarchy oflines. There is the main line that represents the skeleton and a set of parallel lines that run perpendicular tothe skeleton. This secondary set of lines made this method too unpredictable because unless the cross-sectionruns exactly along those secondary lines, multiple maxima will be present. In some cases it worked perfectlywhile in others there would be a false positive or it would miss the legs entirely.

To solve this, I created a completely di�erent algorithm for detecting the legs. The following steps describethe process:

� The general shape of the human body has an empty space almost directly below the spine which islocated between the legs. Let's call this point betweenLegs (shown as a blue circle on �gure 5.5a). Dueto the nature of a distance transform, the distance to betweenLegs and the horizontal distance to theedge of the hips from the branching point is the same. If the distance between skelPixel and the closestzeroPixel is dist, then an area at a vertical displacement dist below skelPixel can be used to check forthe branch. If some of that area is made of zero-pixels, the branch has been detected and betweenLegshas been found.

� As can be seen in �gure 5.5a, each leg is made up of 2 main parts: the line that is located at theheight of the groin and the steeper line that makes up the leg itself. The point where the two meet istheoritically at the same height as betweenLegs. The �rst line can therefore be approximated by �ndingthis point. Let's call this line the legVector.

� If a horizontal intersection is made through the legs at the height of betweenLegs, the centre of eachleg can be found. The lines between these centres and skelPixel are the legVectors. These vectors canbe used as the nextVector to steer the skeleton in the direction of each respective leg. As �gure 5.5asuggests, these are approximate vectors but they are de�nitely su�cient for the purposes they serve.

� estPixel is set to a point that lies a few pixels along the legVector. This way the skeleton algorithm ispushed in the right direction.


(a) (b) (c)

Figure 5.5: (a) The two stages of the legs (b) Branch in the distance transform at the top of the head (c)False end detection when using a single reference point

� When estPixel and nextVector have been set, the skeleton algorithm is continued. This is repeated foreach leg with it's respective values.

I found that this method worked with every sample that I tested it on, provided that the empty space betweenthe legs was actually present vertically below the branch point.

5.1.5 Detecting the end of a body part

A method needs to be used to detect the respective ends of the head, arms and legs. The lines could beallowed to run until they hit a zero-pixel but that would result in kinks in the skeleton due to the branchvisible on the distance transform at the end of each part. Figure 5.5b shows an enlargement of the head from�gure 5.2a to demonstrate this phenomenon. Instead, the skeleton needs to end in the middle of the head,the bottom of the legs (because the feet have been cut o� by the background removal) and the middle of thehands.

It would seem that the same method used for detecting the start of the legs could be utilised here butimplementing this code has shown that it is too sensitive for use with a body part as thin as the arms. Notonly is the margin of error much smaller but if the arm is bent, a false positive could be given due to therebeing zero-pixels when extending the line from before the curve. Figure 5.5c demonstrates this.

The method illustrated in �gure 5.6 is used in the �nal code (the percentages are rough estimates). Insome cases it stops a little too soon but it generally provides good results. It is based on the followingprinciple:

� If a circle is drawn around the current skelPixel with a radius greater than the distance between skelPixeland one of the zeroPixels, some of the circle will be present outside the body.

� If the skeleton has not yet reached its end, the body will enter one side of the circle and leave throughthe other.

� If the skeleton has reached its end, the body will enter the circle on one side but will not exit it on theother. This means that when the end has been reached, more of the circle will lie outside the body.

� By applying a threshold to the number of pixels from the circle that lie outside the body, one candetermine whether or not the end has been reached.

5.2. THREE-DIMENSIONAL SKELETON ALGORITHM 47

Figure 5.6: Detecting the end of a body part. Part circle outside the body is shown in percentages.

This method is also resistant to bends in the shape of the body because if less points are present on one side,more will appear on the other. In practice, either the threshold or the radius of the circle needs to be slightlyadjusted for the intended body part.

5.2 Three-dimensional skeleton algorithm

The inaccuracy of the point clouds that the Kinect delivers makes the direct creation of a 3D skeletondi�cult. The problem is that the points are arranged in layers, rather than smoothly progressing in depth.Additionally, the 3D scene only has one perspective. If an arm is placed before the rest of the body, anythingbehind the arm will be missing, forcing those points to extrapolated based on those that remain visible. Forthis reason it was decided to �rst create a 2D version in which the arms and legs can be clearly di�erentiatedand construct the 3D version based on that.

5.2.1 Projecting from 2D to 3D

Once the 2D skeleton has been found, it is easy to �draw� it onto the 3D point cloud as long as the cloud isstill dense and as a result the u,v-coordinates are still available. u and v are used here to represent the x,ycoordinates in a 2D image as opposed to the x,y,z coordinates in a 3D scene. To build the skeleton in a pointcloud, all points that are not part of it are given a depth value of nan (not a number) while those that are,have a depth that is representative for the skeleton at that location.

Ideally, a skeleton is located in the centre of the body. The type of skeleton that is created in this thesisis supposed to represent that centre line exactly. Unfortunately, the point cloud that is provided by theKinect does not allow for this accuracy. Firstly, the perspective stops half way due to the limitation of theIR pattern. The light cannot reach behind the person and thus a complete scene is not available. Secondly,as mentioned before, the quantisation of the depth values causes the cloud to be made up of layers that areincreasingly separated as the depth increases. If the skeleton is based on these values, it too will be comprisedof layers. Additionally, due to irregularities in the shape of the body (for example, folds in the clothing), thedepth values may jump back and forth. The chosen method compensates for these limitations, producing askeleton that is much smoother but does not lie exactly in the centre of the body:

� Each skelPixel in the 2D skeleton had its nextVector. Following the line perpendicular to this nextVectorin both directions to the edge of the body produces two depth values. It is assumed that when theperson is facing the camera, these points are in line with the centre of the body. This is a reasonableassumption, given the curved nature of the human body. If the person is standing slightly sideways,one side will have a greater visible depth while the other has an equal amount less. The mean value of


(a) (b)

Figure 5.7: (a) 3D skeleton (original) (b) 3D Skeleton (after averaging �lter)

the two will thus be the same, regardless of the relative positioning. The mean of these two values isused as the depth for that skelPixel.

� To compensate for the irregularities in the depth values, a large averaging �lter is applied. Experimentshave shown that a small mask is insu�cient and a size of 15 points provides good results. This mask isalso extended to the extremities of the skeleton where the average is taken of only those values that arein range. Figure 5.7a shows the skeleton before �ltering and �gure 5.7b shows the result after applyingthe �lter.

An unfortunate side e�ect of the �ltering is that the separate body parts do not meet at the same depth.This could be resolved by continuing the mask across the boundaries of neighbouring body parts, which arestored in separate vectors. Due to time restrictions, I chose not to implement this.

Chapter 6

Testing

This chapter will cover the process used to test the algorithms and the results produced by these tests.Section 6.1 will cover which aspects need testing and why. After that, the testing procedure will be laid out,followed by the major issues that were encountered and how they were solved. Lastly, the actual test resultswill be presented and evaluated.

6.1 What needs to be tested

There are 3 main aspects of this thesis that warrant testing: background removal, 2D skeleton creation and3D skeleton creation. As was explained in section 1.3, the true purpose of the skeleton was only made knownnear the end of the thesis due to its con�dential nature. This shifted the importance of the aforementionedaspects. Background removal can be eliminated by using the results from the segmentation algorithm. The3D skeleton is dependent upon the 2D skeleton and the capabilities of the Kinect rather than the algorithmbehind it. Testing it will therefore produce roughly the same quality of results each time.

The most important algorithm to test is the creation of the 2D skeleton. Within this algorithm, the keyquestions are:

� Is the skeleton traced to the end for the head, the torso, each arm and each leg?

� Is the transition from torso to legs detected?

� Does the algorithm that detects the end of a body part work as expected? This is of lesser importanceas the transition between the various body segments is the end goal. Because of this it does not reallymatter if, for example, the skeleton stops in the middle of the hands or at the �ngertips. The resultsfor test will only be of importance when the skeleton is used to actually represent the person.

6.2 How testing was performed

Jorn Wijkmans (another thesis student at the university) created a collection of point clouds for his ownthesis. Some of these could be used to test my own algorithms. The clouds contained several people ofboth genders in a variety of poses. Out of those available, the following poses were compatible with theassumptions in section 4.1.5:

49

50 CHAPTER 6. TESTING

(a) (b) (c)

Figure 6.1: (a) T-Pose (b) Y-Pose (c) I-Pose

� T-pose - Arms held horizontally. Shown in �gure 6.1a.

� Y-pose (or W-pose) - Arms pointing up with the upper arms held horizontally (slightly down forW-pose). Shown in �gure 6.1b.

� I-pose - The arms hang vertically next to the torso with a very small space in between. This poseborders on being contradictory to the assumptions but I decided to include some of these to test thee�ectiveness of Grabcut and the algorithm's resilience to unexpected body shapes. Show in �gure 6.1c.

Each pose also needed the legs to be distinguishable from one another. After sorting through the available�les, a selection of 21 point clouds was made. Of these, 11 are male and 10 are female. Each gender is madeup of 4 di�erent people with a variety of body builds. Each person was asked to wear a tight jacket. Thiswas done to reduce the random e�ect of loose clothing. In the point clouds of myself, which I had used todevelop the algorithm, I was wearing loose clothing. The tight jacket puts additional strain on the algorithmbecause the natural curvature of the human body now plays a part as well.

Because the samples were not created by myself, a di�erent set of conditions was used to create them.This meant that two thing needed to be resolved before testing could be started:

� The points in the clouds were of the type pcl::PointXYZRGBA. These contain alpha values and needto be converted to pcl::PointXYZRGB because my entire code is based on the non-alpha versions andthe viewer cannot display these points along with the respective rgb values.

� The person was asked to stand on a platform. Given that my background removal algorithm uses thelargest horizontal plane as the �oor, it is useless in these samples. Therefore, a di�erent method wasneeded to remove the background.

To resolve these issues, I wrote a new program that converted the cloud to the needed type and removedthe background by using the arrow keys, PgUp and PgDown along with visual feedback to apply x, y and zthresholds. The resulting cloud was then exported to a new .pcd �le.

To then test these samples, I wrote a separate controller program that skips the background removal,generates the skeletons, displays them along with the time taken and saves a snapshot jpg to the hard drivefor later reference. This also demonstrates the modular nature of the algorithm. If only certain parts areneeded, the rest is not forced upon the programmer.

6.3. ISSUES ENCOUNTERED AND RESOLVED DURING TESTING 51

6.3 Issues encountered and resolved during testing

During the testing process, a few issues came to light the prevented the algorithm from working consistently.

First issue The �rst of these was a result of the tight jacket. The highest distance transform value forfemale or slightly overweight subjects was not located in the centre of the chest, as was expected. Instead theskeleton was built starting at the hips. This worked �ne for the head but once the part for the arms began,the program got stuck (in�nite loop) as there are no horizontal skeleton parts near the hips. I resolved thisby measuring the distance from the head to the toes (ignoring the orientation of the arms). If the startingpoint was not in the top third, the highest value in the top third was used instead. This solution workedwith every sample. Figures 6.2a and 6.2b demonstrate this issue and its solution.

Second issue Due to the inconsistent nature of the background mask, the algorithm would sometimes getstuck. This could be observed when the last 1 to 3 points on the skeleton were being repeated in�nitely andwould always coincide with a speci�c set of conditions:

� The skeleton had strayed o� of its intended path. Example: while tracing an arm, it suddenly followsa path perpendicular to the skeleton because of an anomaly.

� When the legs have been detected, the algorithm cannot �nd the proper path along the leg. (see below)

I created a function that detects this and takes action to allow the algorithm to continue. Usually this meansending that body part and moving on to the next. For the torso, the algorithm is stopped because the legscannot be found. For the legs, a di�erent method was used that still allowed the legs to be created.

Third issue Sometimes, as mentioned before, the legs could not be traced. The transition from torso tolegs was identi�ed but when trying to trace the legs along the given vectors, the path is not su�cientlydiscernible. This is caused by the size of the person in the image (in the samples, the person was only abouttwo thirds of the height of the image) or by the shape of the hips. The algorithm already skipped a few pointson the skeleton along the given vector to push the algorithm in the right direction and while this methodworks, increasing the number of points skipped permanently to ensure that it always works is a waste forthose legs that do not need it. To resolve this, I used the function from the second issue to determine when totemporarily increase the jump for that particular leg until it no longer got stuck. This increases the runningtime slightly but will yield better results for those legs that do not need it. This method allows the algorithmto �x itself.

Fourth issue The fourth issue was that sometimes the skeleton reached the edge of the mask. The causefor this could be that it left the expected path or that the function that detects the end of the body partsfailed. When this happens, the algorithm will make unexpected jumps across the image. This results ineither an in�nite loop or a segmentation fault because it tries to access pixels that are not part of the image.Either way, it adds points to the skeleton that do not belong and will cause the program to stop working.To prevent this, the algorithm checks if it is about to hit the edge of the mask with each iteration. If it does,the same action will be taken as if it were stuck.

Final issue In some rare cases, the skeleton will suddenly leave the path. This is unpredictable andpreventing it is not always possible. As an added countermeasure, each new vector to the next point iscompared to the vector to the nearest zeroPixel. The ideal angle when following a straight line is 90°. If the


(a) (b)

Figure 6.2: (a) False starting point (b) Corrected starting point

new vector deviates too much from this angle, it is corrected. The smooth shape of the distance transformwill not contain any sudden changes in angle and a threshold is chosen that has shown not to cause anyadditional problems. This methods reduces the occurrence of this phenomenon but does not fully prevent it.

Once these issues were resolved, the �nal tests could begin.

6.4 Test results

The results from the tests can be found in the appendix. For each �le, it shows if the skeleton reached theexpected end for the head, arms and legs. The tests were run once with the end detection and once withoutto verify that the skeleton is traced to the extreme ends. Additionally, the time taken for the creation ofthe 2D and 3D skeletons was noted separately along with the number of times the algorithm got stuck andhad to correct itself. The expectation is that the time for the 2D skeleton will increase with the number oftimes it gets stuck. If the body part did not �nish without end detection, then the results with end detectionbecome irrelevant.

6.4.1 Results

The bottom of the table shows averages for the measured data. The I-poses are ignored here because theywere not expected to deliver reliable results in the �rst place.

Without end detection The results without end detection are very good. The head and the legs arealways 100% traced though some legs got stuck and had to be restarted and adjusted repeatedly beforethey �nished. The arms fared reasonably, with a success rate of 56% and 63% for the right and left armsrespectively. Each T-pose completed both arms while the Y and W-poses stopped right after the elbow andoccasionally one of the arms reached its goal. The important thing to note here is that with the producedresults, both the shoulders and the elbows can be located. Only the wrists cannot be found during a Y-pose.

6.4. TEST RESULTS 53

The cause lies partially with the size of the person in the image. In the test clouds that I used of myselfduring the development, I covered a larger vertical space on the image. The arms are already the most narrowpart of the body and when the person is represented with less pixels this only makes the matter worse. Ialso found that earlier tests during development showed that when the arms are curved downwards, they arefully traced. This means the relatively sharp angle at the elbows in the Y-poses were also a factor.

With end detection The results with end detection are less desirable (not a single pose was �nishedcompletely). The method used to detect the end of a body part is very dependent upon the given maskand depending on the situation, the results vary wildly. The values used internally by the algorithm can betweaked for a given scenario and produce excellent results but when applied to another they may fall apart.When the detection failed, it either stopped too soon due to anomalies in the width of the body part or didnot stop in time and was ended by the edge of the mask. The head has a speci�c problem with this algorithm.It start o� wide in the centre of the torso and then suddenly becomes very narrow at the neck, only to becomemuch wider again when the actual head starts. If the end detection is calibrated to the actual head, the neckends the skeleton prematurely but if the sensitivity is turned down, the algorithm does not detect the end ofthe head in time. Ultimately, as mentioned before, these results are only important when using the skeletonto represent the person. When used to supplement the segmentation algorithm from section 1.3.1, the enddetection can be disabled and will not be missed. Oddly enough the left leg seemed to have a much highersuccess rate (44%) than the right leg (13%) but that is pure coincidence.

Time taken and stuck occurrence The average time taken is 366 ms for the 2D skeleton and 30 ms forthe 3D skeleton. The 3D skeleton does not vary much because it is purely based on the amount of points inthe skeleton. The 2D skeleton times stay around 320 to 400 ms. There are 3 anomalies that were excludedfrom the average (501, 745 and 1400 ms). These were caused by the high number of times the algorithm gotstuck and had to restart. Ignoring these anomalies, the average number of times the algorithm got stuck is1.3 which is more than acceptable because stuck merely means the algorithm has to adjust itself and redo amaximum of 3 points on the skeleton.

I-poses The I-poses provided mixed results. In each case, the head and legs were detected perfectly, thoughthe starting point was often too low, making it seem as if the person has a really long neck. Sometimes anarm was found when there was enough space between the arm and the torso but the arms are not reliableenough to be used. The Grabcut algorithm (in the function Filters::trimFalseEdges) was applied to the maskeach time to remove the parts of the background that always remain at the edges and, in this case, in betweenthe arm and the torso. The entire foreground part of the mask was set to probable foreground and the restto de�nite background. Grabcut removed some of the excessive pixels but not enough to make the I-posesuseful for the arms. If time is a factor, Grabcut can be excluded from the algorithm but when left in it canmake some �nishing touches to the foreground mask.

Background removal The original background removal algorithm from section 4.1.5 was not needed forthe main tests but a few test in a range of environments showed that it works as expected, with a processingtime between 200 and 700 ms. A higher number of points located behind the person will result in a fastertime. The algorithm works as long as the centre pixel belongs to the person, the person is standing on thelargest horizontal surface (the �oor) and the Kinect is positioned parallel to the �oor. The advantage of thismethod is that it is based purely on the IR projection. Other background removal methods rely on the rgbvalues of the image but this methods works equally well regardless of the lighting conditions and the colourof the background.


(a) (b) (c)

Figure 6.3: (a) Complete skeleton (no end-detection) (b) Incomplete skeleton due to failed end-detection(points skipped in leg when stuck) (c) Incomplete skeleton due to bent arm (no end-detection)

Head Right arm Left arm Right leg Left Leg

100% 56% 63% 100% 100%

(a)

Head Right arm Left arm Right leg Left Leg

19% 6% 6% 13% 44%

(b)

Table 6.1: Average success rates (a) without end-detection (b) with end-detection

3D skeleton generation The 3D skeleton was generated consistently each time the 2D skeleton had beenbuilt. As mentioned earlier, the torso is located at a greater depth than the arms due to the inconsistency ofthe depth values at the edges of the body. Regardless of this, the result was always as expected and as longas the 2D version is a correct representation of the person, nothing can really go wrong here.

Examples Figure 6.3a is an example of a complete skeleton created without end-detection. This is the idealresult. Figure 6.3b was complete without the end-detection but with it turned on, it is stopped prematurely.It is also an example of the right leg repeatedly getting stuck and quite a bit of points needing to be skipped.This was an extreme case but it demonstrated the resilience of the �stuck prevention method�. Figure 6.3csucceeded everywhere except on the right arm. As mentioned before, the skeleton goes o� track after theelbow, making it still useful for everything but the right wrist.

Tables 6.1a, 6.1b, 6.2a and 6.2b show a concise version of what is present in the test results in theappendix.

2D skeleton 3D skeleton

366 ms 30 ms

(a)

Nr of stuck warnings

1.3 (min: 0 / max: 12)

(b)

Table 6.2: (a) Average time taken for the creation of the 2D and 3D skeletons (b) Average number of stuckwarnings that were subsequently corrected (along with min/max values)

6.5. CONCLUSIONS 55

6.5 Conclusions

The tests shown that, after the issues in section 6.3 were resolved, the skeletal tracking algorithm works verywell. When the arms are su�ciently bent, the skeleton may stop following its intended path but this alwayshappens after the elbows. This means that the wrists are the only joints that cannot always be located. TheI-poses demonstrated that the assumptions must be met; each of the arms and legs must be di�erentiatedfrom the torso and one another. It can, however, still be used to track anything but the arms.

Detecting the end of a body parts was not as successful. When properly calibrated to one person, standingat a constant distance, the method works but this defeats the purpose of creating a universal algorithm. Ifthe end is not detected, the edge of the body is always detected so the low success rate will not cause theprogram to crash but it may end prematurely, leaving out a part of the skeleton.

The additional safety check ensure that the algorithm will (almost) never get stuck in an in�nite loop orencounter a segmentation error. The check detects a continuous repetition of up to 3 di�erent values. Theaverage time taken to create both skeletons lies just under half a second. With optimization this could beimproved. The 3D skeleton is based on the depth values from the Kinect. If a point cloud could be generatedthat shows the body from all sides, a more accurate 3D skeleton could be created but with what is given, theresult is adequate.


Chapter 7

Future work

This chapter explores possible future improvements and extensions to the results of this thesis. Section 7.1postulates what could be done with a more reliable 3D model of the person. Section 7.2 presents somesuggestions for improving the success rate when tracing the arms and section 7.3 details the intended use ofthe skeletal creation algorithm in the research by Koen Buys et al. [2].

7.1 Improving upon the 3D skeleton

The point clouds for this thesis were created using a Kinect but, as was mentioned in the beginning of thistext, the code was written to be independent of hardware. If it were possible to create a 3D, non-hollow,voxel based version of the person, the distance transform could be applied in 3D. The basic principle of theskeletal generation would remain the same but it would now be possible to follow an arm or a leg that islocated in front of or behind the person (as long as it does not touch other body parts). Whereas in 2D thearm would seem to be part of the torso, in 3D it would be a separate entity. This would also allow the personto be facing in any direction. A fully 3D version of the person would also allow for the depth values of the�nal 3D skeleton to be signi�cantly more accurate.

7.2 Keeping the 2D algorithm on track when tracing an arm

The only body parts that the 2D algorithm seemed to have trouble with were the arms. This is because thearms are the thinnest of the body parts. If the person does not stand close enough to the camera, the widthof the arms decreases even further. The problem here is not actually the physical width of the arms but thenumber of pixels that represent it. A greater number of pixels provide a more gradual change in values forthe distance transform, resulting in a more reliable path traced by the skeleton. Experiments with bent legshave shown that these do not have this problem as they are much wider.

One solution would be to magnify the mask and then perform a distance transform on it. Doing thiswill increase the accuracy of the distance transform as it has more pixels to work with. It would be liketaking a picture with a better camera, rather that enlarging the picture itself and interpolating the pixels.The edges of the mask will become slightly blurred but this is of little to no consequence. A zero ordermagni�cation can be used to reduce processing time as it does not have to waste resources on calculating

57

58 CHAPTER 7. FUTURE WORK

bilinear interpolations. Afterwards, the skeleton points can be scaled back down to the original size. Theskeleton will not have strayed from its intended path and the accuracy of the points will have been retained.

7.3 Combining with the segmentation algorithm

The true purpose of the skeletal generation algorithm was described in section 1.3. It could, however, be usedin at least two ways. One is to generate the entire skeleton and superimpose it upon the body segmentation.The parts where the skeleton crosses two areas with di�erent colours can then be used as reference points.Another is to take two areas for which the reference point is needed and apply the skeleton algorithm to onlythat part.

An example would be the following: say you want to �nd the location of a knee. The segmentationlabelled one area as the upper leg and one as the lower leg. Using a mask that covers both those areas, adistance transform would show a line along the length of the leg. If the centre of the mask is used as thestarting point and the algorithm is given two starting vectors in opposite directions roughly along the lineof the leg, a mini skeleton could be generated. After the two sets of points are combined, the point mostrepresentative of the knee could be selected. The major advantage of this method is that any pose can beanalysed. The segmentation algorithm is based on training and recognition of features. This means that thearms could touch the torso and still be identi�ed. If the mask is then applied to only an arm, the elbow canbe found, ignoring the torso, while this would not be possible if the entire skeleton was �rst generated givenan I-pose.

Chapter 8

Conclusion

In section 1.4, the expected progression and intended objectives for this thesis were de�ned. This chapterwill brie�y compare these to what was actually achieved and the quality of the results.

8.1 Progression and steps followed

These were the steps that needed to be completed to achieve the goal of this thesis.

Capture and store the needed data. Single point clouds were captured from a Kinect using ROS topicsand were then converted to RGB images. This meant that only the point clouds needed to be saved as .pcd�les.

Separate the person from the background in the 2D image. An algorithm was designed based onthresholding the point cloud by using depth values, detecting and removing the �oor using the point cloudand then using connected component segmentation on the RGB image to single out the group of pixels thatrepresented the person. For this to work, the person needs to stand roughly in the centre of the image andbe in contact with the �oor, which is the largest horizontal plane in the scene. The result was as expectedwith a processing time of 200 to 800 ms.

Develop a method to draw a pseudo-skeleton on the 2D image of the person. An algorithm wasdeveloped that successfully traces the pseudo-skeleton visible when creating the distance transform of theforeground mask. As long as the legs and arms are di�erentiated from each other and the torso, all bodyparts can successfully be traced except when the arms are bent too much. A solution to this problem isproposed in section 7.2. The result is a set of coordinates grouped into 6 body parts: head, right arm, leftarm, torso, right leg, left leg.

Methods were also created for detecting the transition from the torso to the legs and the end of a bodypart. After some tweaks the former worked with a 100% success rate but the end-detection did not alwaysperform as intended. Due to the irregular shape of the foreground mask and varying body shapes, thealgorithm could be tweaked to work in one situation but not universally. It caused the skeleton to either stopprematurely or miss the end of the body part altogether. When allowed to run its course, the 2D skeletonalgorithm will continue to following its path to the edge of the mask.

59

60 CHAPTER 8. CONCLUSION

Adapt the method used for a 2D skeleton for use with the 3D point cloud. The 3D skeleton wascreated using the depth values at the local edge of the body part in question. After this, the points weresmoothed to produce a nicer result. As long as the 2D skeleton is a correct representation of the person,the 3D skeleton will produce the expected results. As with its 2D counterpart, the result is a set of 3Dcoordinates grouped into the 6 body parts.

Test both algorithms with a variety of subjects and poses and (if time allows) implement themin an application. The algorithms were tested using 21 point clouds of 4 men and 4 women in 3 di�erentposes. As mentioned above, the results were very good except when the arms were bent too much. Detectingthe end of the body parts did not work well but is ultimately unnecessary when the algorithms for theskeletons are combined with the research by Koen Buys et al.

There was insu�cient time to implement the results in an application but the main goals of the assignmentwere achieved.

8.2 Objectives planned and achieved

These core questions needed to be answered to obtain an optimal understanding of what is possible with thegiven environment and equipment.

What is the most e�cient way to segment an image in order to separate a person from his/herbackground? Various methods were experimented with. The �nal method that was chosen is based ondepth thresholding and connected component segmentation. See section 4.1.5 for a more detailed description.

What degree of accuracy is needed when removing the background to facilitate the generationof the skeleton? The point cloud produced by a Kinect assigns a depth equal to the person to some of thebackground pixels. This makes the body parts wider but as long as the same amount of pixels are added oneach side, the location of the skeleton will remain the same. These false foreground pixels gain signi�cancewhen they are located between two body parts and could make it seem as if they are combined into oneentity. It is very important that each body part is discernible. Grabcut can be used to reduce this e�ectpartially.

How is the skeleton best constructed? The method used in this thesis is to start on the torso at armheight and to proceed up, down, left and right from there. This point should also have the highest value inthe distance transform (if it is not, this is automatically corrected in the algorithm). At the bottom of thetorso, the legs must then be detected. This leaves a total of 6 separate body parts: head, right arm, left arm,torso, right leg, left leg.

Is it possible to create a true 3D skeleton with the accuracy provided by the Kinect? The pointclouds provided by the Kinect proved to be too inaccurate to create a truly reliable 3D skeleton. Nevertheless,the 3D skeleton created by the algorithm in this thesis is a representation of the person that should su�cefor many applications. There are two main limitations that cause problems:

� The cloud only represents the scene from one perspective. The IR light travels in straight lines andstops somewhere along the person's sides. The problem is that it remains unknown if this depth at theedge of the person represents the depth at the person's centre or slightly before or after. Body partssuch as the ears also cause problems as the maximum depth immediately below them will be greater.

8.3. CLOSING THOUGHTS 61

� If, for example, an arm is located in front of the torso, that part of the torso that lies behind the armdoes not exist in the point cloud. This means it has to be interpolated before a distance transform canbe performed.

Section 7.1 explores the possibilities when a truly reliable 3D model is available.

What assumptions must be made to keep the various aspects manageable? Section 4.1.5 and5.1.1 specify the assumptions taken for the algorithms in this thesis to work. The most signi�cant of these isthat for the 2D skeleton, the arms and legs must not touch each other or the torso.

8.3 Closing thoughts

The goals set in this thesis were obtained and the results are mostly as desired. Many aspects could still beimproved upon and as the code was written in a modular fashion, it is relatively simple for someone else totake some of my code and expand on it for future projects.


Appendix

63


File

nam

eR

each

ed th

e ex

pect

ed e

nd (

with

out e

nd d

etec

tion)

? y(

es)

n(o)

Rea

ched

the

expe

cted

end

(w

ith e

nd d

etec

tion)

? y(

es)

n(o)

Hea

dR

ight

arm

Left

arm

Rig

ht le

gLe

ft le

gH

ead

Rig

ht a

rmLe

ft ar

mR

ight

leg

Left

leg

boy1

Iy

ny

yy

n/

nn

nbo

y1T

1y

yy

yy

nn

ny

nbo

y1T

2y

yy

yy

yy

yn

nbo

y1W

yn

ny

yn

//

nn

boy1

Yy

nn

yy

n/

/n

ybo

y2T

yy

yy

yn

nn

nn

boy2

Yy

yn

yy

nn

/n

ybo

y3T

yy

yy

yn

nn

nn

boy3

Yy

nn

yy

n/

/n

nbo

y4I

yy

ny

yn

n/

ny

boy4

Yy

nn

yy

n/

/n

ngi

rl1I

yn

yy

yn

/y

nn

girl1

Ty

yy

yy

nn

ny

ygi

rl1Y

yn

yy

yn

/n

ny

girl2

Iy

nn

yy

n/

/n

ngi

rl2T

yy

yy

yn

nn

ny

girl2

Yy

yn

yy

yn

/n

ngi

rl3Y

yn

yy

yn

/n

ny

girl4

Iy

nn

yy

y/

/y

ygi

rl4T

yy

yy

yn

nn

nn

girl4

Yy

ny

yy

n/

nn

yA

vera

ge:

100%

56%

63%

100%

100%

19%

6%6%

13%

44%

(Ign

ore

I-po

se)

Nam

ing

conv

entio

n: S

hape

of t

he b

ody

repr

esen

ted

by a

lette

rI =

Arm

s at

the

side

(us

ually

har

d to

dis

tingu

ish

betw

een

arm

s an

d to

rso)

T =

Arm

s st

raig

ht o

utw

ards

hor

izon

tally

W =

Arm

s be

nt a

nd p

oint

ed u

p (u

pper

arm

s no

t hor

izon

tal)

Y =

Arm

s be

nt a

nd p

oint

ed u

p (u

pper

arm

s ho

rizon

tal)

With

det

ectio

n, th

e en

d sh

ould

occ

ur b

efor

e th

e di

stan

ce tr

ansf

orm

fork

s ag

ain.

With

out d

etec

tion

the

end

is r

each

ed if

the

line

does

not

sud

denl

y st

art f

ollo

win

g a

diff

eren

t par

t.If

the

end

is n

ot r

each

ed w

ithou

t det

ectio

n, th

e re

sult

with

det

ectio

n be

com

es ir

rele

vant

and

is le

ft em

pty.

Tes

t R

esu

lts

File

nam

eN

r or

stu

ck2D

Ske

leto

n3D

Ske

leto

nw

arni

ngs

boy1

I41

830

3bo

y1T

140

628

0bo

y1T

246

330

5bo

y1W

337

310

boy1

Y39

430

5bo

y2T

501

3010

boy2

Y74

529

5bo

y3T

421

291

boy3

Y38

129

3bo

y4I

346

280

boy4

Y34

429

0gi

rl1I

362

281

girl1

T35

229

0gi

rl1Y

325

290

girl2

I34

930

2gi

rl2T

1400

2812

girl2

Y38

237

1gi

rl3Y

400

293

girl4

I25

229

0gi

rl4T

339

290

girl4

Y31

730

0A

vera

ge:

366

301,

3(I

gnor

e I-

pose

)

Tim

e ta

ken

(ms)

Bibliography

[1] Mikkel Viager. Analysis of Kinect for mobile robots, Individual Course Report for Universityof Denmark, March 2011

[2] Koen Buys, Cedric Cagniart, Anatoly Baksheev, Caroline Pantofaru. An adaptable system forhuman body tracking. To be published in 2012

[3] PrimeSense. PrimeSensor Rererence Design 1.08, in primesense.360.co.il/�les/FMF_2.PDFon 28/11/2011

[4] www.engadget.com on 06/05/2012

[5] C. Rother, V. Kolmogorov, A. Blake. �Grabcut� - Interactive Foreground Extraction usingIterated Graph Cuts, SIGGRAPH'04, 2004.

[6] R Diestel. Graph Theory - 4th Electronic Edition 2010

[7] T. Collins. Graph Cut Matching In Computer Vision, University of Edinburgh, February 2004.

[8] D. Grest, V. Kruger and R. Koch. Single View Motion Tracking by Depth and SilhouetteInformation, in Scandinavian Conference on Image Analysis (SCIA07, 2007)

[9] D. Grest, J. Woetzel. and R. Koch. Nonlinear body pose estimation from depth images. InProc. DAGM, 2005

[10] Y. Zhu and K. Fujimura. Constrained optimization for human pose estimation from depthsequences. In Proc. ACCV, 2007

[11] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake.Real-Time Human Pose Recognition in Parts from Single Depth Images, In Proc. CVPR, 2011

[12] G. Borgefors. Distance Transformations in Digital Images, In Computer Vision, Graphics andImage Processing Vol 34, 344-371, 1986

[13] A.A. Eftekharian and H.T. Ilies. Distance Functions and Skeletal Representations of Rigidand Non-rigid Planar Shapes. In Computer- Aided Design, 41(12):865�876, 2009.

[14] http://pointclouds.org on 02/12/2011

[15] http://opencv.willowgarage.com on 08/10/2011

[16] http://www.willowgarage.com on 05/11/2011

67

68 BIBLIOGRAPHY

[17] http://www.ros.org on 02/12/2011

[18] http://www.ros.org/wiki/openni_camera on 05/11/2011

[19] http://www.openni.org/ on 22/10/2011

[20] http://www.primesense.com/en/technology on 21/11/2011

[21] http://en.wikipedia.org/wiki/3D_scanner on 25/11/2011

[22] http://en.wikipedia.org/wiki/Time-of-�ight_camera on 25/11/2011

[23] http://inperc.com/wiki/index.php?title=Stereo_vision on 25/11/2011

[24] Niels Van Malderen. Human motion capturing: Visual hull reconstruction, Masters Thesis2009-2010 for Lessius Campus De Nayer.

[25] http://www.makehuman.org on 27/12/2011

[26] http://www./help/toolbox/images/f8-20792.html on 04/01/2012

[27] http://www.primesense.com/en/component/content/article/5-news/25-gestural-interfaces-controlling-computers-with-our-bodies

[28] http://www.mesa-imaging.ch/prodview4k.php on 24/05/2012

[29] http://www.boost.org/doc/libs/1_49_0/libs/bind/bind.html on 24/05/2012

[30] http://java.sun.com/blueprints/patterns/MVC-detailed.html on 24/05/2012

[31] M.A. Fischler and R.C. Bolles. Random sample consensus: A paradigm for model �tting withapplications to image analysis and automated cartography, In Communications of the ACM,24(6):381�395, 1981.

Documents

Faculteit Industriële Ingenieurswetenschappen* – Faculteit ... · Het skelet dat in deze thesis geproduceerd wordt moet gebruikt worden om boven op de lichaamsgebieden te leggen