vision based skin color segmentation

8/10/2019 vision based skin color segmentation

1/7

VISION-BASED SKIN-COLOUR SEGMENTATION OF MOVING HANDS

FOR REAL-TIME APPLICATIONS

S. Askar, Y. Kondratyuk, K. Elazouzi, P. Kauff, O. Schreer

Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut

Germany

ABSTRACT

We present a robust vision-based skin-colour

segmentation method for moving hands in a real-time

application. Segmentation of hands is an important

processing step in gesture recognition applications,

where the general shape and position of the hands are of

interest. In contrast to these approaches, the presented

method concentrates on an accurate segmentation,which is required for further processing steps in a real-

time videoconferencing application. A hand tracking

procedure is applied to improve the segmentation in

terms of accuracy, robustness and processing speed.

Furthermore the presented approach can accomplish

difficult situations like contact between hands or contact

between face and hands. This is important for many

real-time applications, e.g. for the presented

videoconference system to allow the conferees a natural

behaviour. Moreover we present an approach for an

automatic initialisation of the skin-colour range to the

specific user. We show experimental results proving the

efficiency and reliability of our approach. The proposedhand segmentation method is capable of processing TV-

sized (CCIR 601, 576x720 pixels) video images in real-

time with 25 Hz on a common PC. The presented

approach will support any video processing in visual

media production, where segmentation accuracy and

real-time capability is required.

1 INTRODUCTION

Numerous applications use skin-colour as one of the

basic features for detecting or analysing human face orhands. They have different aims and different

constraints under which the human face or hands are

being analysed. One crucial point, which is common for

most of the applications in this context, is an accurate

segmentation of human face or hands. Manyapplications deal with segmentation of hands, such as

hand sign recognition, human vehicle interaction,

human computer interfaces, but common to all is a

rough segmentation result as other features are derived

(refer to Cui (1), Imagawa (2), Guo (3), Zhu (4), Starner(5)). In some hand segmentation approaches marked

gloves are used, which are not applicable in video

conferencing systems ( see Dorfmueller (6)). In othersapproaches infrared cameras are used or depth

information based on multi-views is exploited e.g Sato

(7), Malassiotis (8), Jennings (9). The real-time

constraint is considered as well, but just in gesture

recognition applications e.g. in Lovell (10), Herpers

(11). The combination of accurate segmentation of both

hands, robust tracking including overlap of hands and

head, and real-time capability on high resolution video

has not been considered in any publication before.

The presented method on segmentation of hands using

skin-colour is a resulting work of a project on

immersive 3D video conferencing, which is beingdeveloped at FhG/HHI (see Kauff (12)). In this system,

segmentation of hands is an important part to improve

different succeeding processing steps such as disparity

estimation or synthesis of virtual views.

The approach is a new robust segmentation method for

moving hands, which can also handle contact between

hands and contact between hands and head. The

fundamental algorithm is based on skin-colour

segmentation and uses so called bounding boxes, which

tracks hands and head separately (Fig. 1). A spatial sub-

sampling in the considered bounding boxes guarantees a

more robust and additionally a fast segmentation.

Furthermore we present a new initialisation algorithm,which determines the specific skin-colour values

automatically based on the first few images. This is very

important in order to limit the skin colour range to the

specific person and to achieve robustness and accuracy.

Our algorithms process TV-sized video images (CCIR

601, (576x720) pixels) in real-time with 25 Hz.

Fig. 1:Hand contours and bounding boxes

In contrast to many approaches in the field of e.g.

gesture recognition or controlling movements, we arenot interested in features of the hands like orientation of

the hands or the motion. The key problem in our

application is to find a closed region, which coincides as

much as possible with the real contour of the hand to

specify depth discontinuities. In the proposed method,

an initialisation of the specific skin colour range isfollowed by the real-time segmentation, which consists

of two succeeding steps: the tracking of hands and the


2/7

skin-colour based segmentation (see Fig. 2). The

tracking of hands is performed on a sub-sampled QCIF

image, whereas in the second step a region growing

approach is applied to the full resolution image toextract the final hand segments.

Determinationof skin colour

range

Hand-tracking

Skin-coloursegmentation

Tracking SegmentationInitialisation

on sub-

sampled imageon originalimage size

Fig. 2:Block diagram of the presented method

In the next section, the concept of immersive 3D

videoconferencing is described and the relevance of

accurate segmentation of hands in this context is

explained. Then, the automatic initialisation of thespecific skin-colour range of the participant is

presented. In section 4 and 5 the tracking of hands and

the skin colour segmentation are proposed. In section 6,

a solution in the case of overlapping skin-coloured

regions is presented. Experimental results are shown in

section 7. The paper ends with a conclusion.

2 IMMERSIVE 3D-VIDEOCONFERENCING

The basic idea of immersive 3D videoconferencing isthat the participants are perceiving the virtual

conference scene under correct perspective. This

includes full eye contact with the remote participants,

although the cameras are mounted around the display. It

is achieved by a synthesis of virtual views in order tosimulate a virtual camera at the correct position on the

display (see Lei (13)). The current demonstrator of the

immersive 3D videoconferencing system is shown in

Fig. 3.

To achieve the correct perspective view of the remoteparticipants, a real-time capable disparity estimator hasbeen developed calculating depth information from

stereo camera images (see Schreer (14)). Although this

disparity estimator provides convincing results, it fails

at depth discontinuities in occluded areas, where pixel

correspondences can not be calculated. This leads to

artefacts in the synthesized views. Due to the nature ofthe videoconferencing application depth discontinuities

mainly occur at the contours of the free gesticulating

hands. Due to the fact that artefacts at these areas

extremely bother the impression of immersiveness and

natural representation of the remote participants, a

reduction of these effects is desired.

Fig. 3:Demonstrator of the immersive 3D

videoconferencing system

Accurate segmentation masks of both hands provide a

very helpful information in order to improve thedisparity estimation by replacing wrong or unknown

disparities with reliable values (see Schreer (14)). The

colour of human hands is a striking feature to offer a

solution to this problem. Segmentation of skin-colour

can provide information about depth discontinuity at thecontour area of the hands.

3 INITIALISATION OF SKIN-COLOUR

Usually, TV- and video data are available in the YUV-

colour space. Hence, the investigations in this approachhave been made in the YUV-colour space. Other colour

spaces (e.g. HSV, HSI) aim to provide a more uniform

and accurate representation of colour interpreting it in

the same way as the human perceptual system does. Butthe transformation from video signals to this colour

spaces is a very time consuming process and needs to be

avoided in real-time applications. In our proposed

algorithm we consider only the chrominance (U,V

channel), which are fully representing the colour. The

We skin-colour is described as a quadruple consisting of

the mean values mu, mvand the tolerance values tu, tv.

The spectral reflectance of human skin is independent

on the human race and on the wavelength of theexposed light (see Anderson (15)). The same

observation can be made considering the transformed

colour in common video formats. Hence, the human

skin-colour can be defined as a global skin-colour

cloud in the colour space (see Strring (16)). The

general thresholds for this striking area are still too large

to obtain reasonable segmentation results. Depending on

different factors like shadows, illumination, colour

distribution in the particular video data, different

pigmentation of the persons skin and so on, it is useful

to adapt thresholds to the given illumination conditions

and the observed person. Therefore, it is assumed that

the skin-colour of a human under certain conditions canbe considered as a subset of a global skin-colour

cloud. Hence we distinguish between the following


3/7

two terms: 1) global skin-colour, representing skin-

colour in a general way with large tolerance values, and

2) skin-colour, representing the skin-colour for the

specific person under certain illumination conditionsdescribed with specific mean values and a reduced

tolerances.

Hence, an important question arises how to determine

appropriate skin-colour parameters for a scenario to

achieve best segmentation results. Applying parameters

from general statistical analysis of skin-colour does notlead to optimal segmentation results in the majority of

cases. But they can be often used as good coarse start

values to find appropriate parameters by slightly

varying them.

One option is to adapt the thresholds manually at thebeginning of the segmentation. This is obviously not

convenient in terms of usability and user friendliness of

in the case of a video conferencing system. Therefore, a

quasi-automatic method is presented to find suitable

parameters. Nevertheless in the case of extreme dark orextreme bright illumination an additional manualadjustment is unavoidable. However we experienced,

that for brighter illumination, it is reasonable to chose

larger tolerance values than for dark cases.

The initialisation step is performed in the sub-sampled

image for real-time and stability purposes. Beside the

desired skin-colour range, it provides also three centresof gravity of the two hands and the head, which are used

as start positions for the bounding boxes. In the first

image, a pixelwise skin-colour segmentation is

performed. For the initial skin colour range, threshold

values are obtained from statistical analysis of a number

of images representing the global skin-colour cloud.After applying the global thresholds, a rough binary

mask is obtained, which is filtered to reduce noise.

The goal of the following process is to determine the

blob position of both hands and the head and to

calculate new and more accurate skin-colour threshold

values in the distinct area. Hence, the row and column

histograms of the binary image are calculated, which

represents the skin coloured pixel-distribution in

horizontal and vertical direction (Fig. 4).

Fig. 4:Row and column histogram of binary image

The image is now divided in three equal stripes inhorizontal and vertical direction, which leads to nine

equal areas in the whole image. For each stripe the

maximum in the corresponding histogram interval is

determined (Fig. 5). The points of intersection of the

horizontal and vertical maxima yield nine potential

positions of the centres of gravity of possible hand orhead blobs (Fig. 6). Obviously some of them have to be

wrong.

A neighbourhood analysis searching for the points with

the most skin-coloured neighbour pixels removes wrong

points. The resulting three positions mark the three skin-

colour blobs: left hand, right hand and face (Fig. 7).

Fig. 5:Determination of maximum in each stripe

Fig. 6:Points of intersection based on horizontal and

vertical maxima

Fig. 7:Resulting blob positions

The proposed histogram method is independent on the

orientation of the camera. Obviously this approachworks reliably, if the hands and head are in different

image regions, but not necessarily at specific positions.


4/7

In order to distinguish between the face and the hands

the assumption was made, that the object on the top is

related to the face. In the case of larger rotations of the

camera, this information must to be taken into accountto assign the blob position of the face correctly (eg.

most left object, most right object, ...). In summary a

few rules have to be considered by the observed person,

but very simple and general ones. The experiments have

proven, that the correct blob positions are computed

reliably after a few frames.In our online real-time application, the initialisation will

be repeated as long as three separate and reliable blob

positions are determined. After successful computation

of the blob positions, the skin-coloured pixels around

are analysed and the specific new mean values andtolerances are derived. After delivering of the

initialisation parameters, the vision-based hand

segmentation starts. During the hand segmentation

process the bounding boxes are kept under surveillance.

Some hints, e.g. no segmented pixels in a box or ifboxes leave the image area, lead to re-initialisation. Forthese cases we assume either failure with tracking

process, or change of the environment conditions

(shadows, illumination change).

4 TRACKING

An enormous optimisation in speed can be achieved, if

the segmentation of the hands is limited to a specific

region. This is very important for real-time applications.

Therefore we define two bounding boxes, which trackthe hands of the conferee continuously during the

conference session and the search is performed just

inside these boxes. After a pixelwise skin-colour

segmentation inside each bounding box, we calculate

the centre of gravity of the obtained skin-colour area in

the whole box. The purpose of the tracking phase is to

determine the centres of gravity of the hands using the

previous bounding box area. Then a new bounding box

position is calculated and delivered to the succeeding

segmentation step. In Fig. 8,left, the calculation of the

new centre of gravity inside the old box is depicted. The

shifted box to the new position is shown in Fig. 8, right.

old boundingbox position

new boundingbox position

get newcentre

ofgravity

newcentre

ofgravity

Fig. 8:Tracking of the centre of gravity

Moreover we perform the whole tracking process in the

sub-sampled image to achieve further reduction of

processing time. As in the tracking step only a blob-

tracking is performed, it is sufficient to apply thisprocedure in the sub-sampled image. The usage of

bounding boxes and the tracking in the sub-sampled

image have much more impact beside real-time

capability as the segmentation becomes extremely

robust. Pixels spatially far apart from hands and head,

but coloured similar to skin-colour do not have anyinfluence to segmentation result. In addition, the

tracking process in the sub-sampled image filtered as

well and thus leads to a reduction of noise concerning

the blob positions.

5 SKIN-COLOUR SEGMENTATION

For the succeeding accurate segmentation of the handsin the high resolution image, a skin-colour segmentationmethod based on a region growing technique has been

developed. The region growing approach requires a so

called seed point for the segmentation, which is

provided by the previous tracking process. Starting from

this point the segmented area is enlarged by analysing

continuously the neighbours of the segmented pixels.The advantage of this technique is that it leads to one

closed region.

The region growing approach accounts for the case that

the gravity point obtained from the tracking step does

not have skin-colour, e.g. because it lies between two

fingers or because hand boxes overlap and misleadtracking. Additionally, the situation is borne in mind, if

a contact between hands and face happens. Both cases

are discussed in the following section.

Fig. 9:Region growing

6 CONTACT OF SKIN-COLOURED AREAS

In order to allow the participants of the conferencesession a natural behaviour with free gestures, contact

between both hands and also contact between hands and

face has to be considered. This is done twice, in thetracking step as well as in the segmentation step. If the


5/7

hands are very close to each other the following

problem occurs in the tracking phase. For example, if

the left hand box also detects a part of the right hand in

the box (see Fig. 10, left), then tracking could bemislead. Due to the segmented parts of the other hand,

the centre of gravity could be shifted to a wrong

position with no skin-colour.

In the worst case, the search for a skin-coloured region

in the neighbourhood of the determined non-skin-

coloured centre of gravity will lead to the wrong righthand and the left hand might be lost. In order to avoid

this case, the search directions, starting from the

determined centre of gravity, are limited to the opposite

of the position of the right hand box and vice versa (Fig.

10, right). A fast and sophisticated retrieval strategypreserves the correct hand.

part of lefthand

part ofright hand

left handbounding box

searchdirections

Fig. 10:Contact of hand boxes

If both hands have contact to each other, then the

bounding boxes overlap. If the hands come apart, the

bounding boxes must be separated obviously, which isagain not trivial. To overcome with this problem, thefollowing approach has been implemented to separate

the bounding boxes, when the hands get separated. For

each bounding box favourite directions are defined e.g.

left -bottom edge for one box and right-top edges for the

other. While the hands are in contact, the boxes are just

allowed to move in the preferred directions. If the handsare not connected any more, the preference gets

switched off and the movement of the bounding boxes

is not limited further on. Example images of a sequence

are presented in the next section.

If a hand has contact with the face, the following is

performed: In addition to the hands, the head blob of theparticipant resulting from the initialisation phase is

tracked as well in the sub-sampled image, using a third

bounding box. If one of the hand boxes overlaps with

the head box, then only the non-overlapping part of the

hand box is considered for tracking the centre of

gravity. This is shown in Fig. 11,left.Thus, wrong movement of the hand box towards the

head is avoided and tracking becomes much more

robust without loosing the hand. Otherwise, if a hand

disappears completely inside the head box, an area

surrounding the head box is observed waiting for the

hand to come out of the head box (see Fig. 11,right). Ifskin-coloured pixel are detected in the surrounding area,

the tracking of the hand is going to be continued.

observedarea

handdetection

unconcernedarea

Fig. 11:Contact of hand and head box

7 EXPERIMENTAL RESULTS

The presented methods are running on a standard PC

Pentium IV, 2GHz in real-time on full TV resolution

video (576x720 pixels at 25Hz). Hence all situations

such as different behaviour and gestures, have been

tested under real conditions. The following extractedimages of a sequence will show the robustness in

several situations and the accuracy of the segmentation.

In Fig. 12,an example is given, where the hands contact

each other and come apart. After the contact tracking is

still successful and the bounding boxes can be separated

correctly.

In Fig. 13,a misleading tracking is shown. In this case

the right hand box is getting lost after contact of the

hand with the face. Instead of it, the face of the person is

wrongly tracked. The successful operation of our

method is shown in Fig. 14 and Fig. 15 for situations,where just a single, but also both hands have contact

with the face region.

Fig. 12:Contact of hands


6/7

Fig. 13:Contact of hand and head box, hand box is lost

Fig. 14:Contact between hand and head, correct tracking

Fig. 15:Contact of both hands and head together, correct tracking (order: left to right)

The image series (Fig. 14) shows, that the right hand

box still tracks the hand correctly after the contact usingthe head box processing method. Despite the robust

tracking it must be indicated that with our algorithm it is

not possible to determine the contours of the objects

while they are connected. Only if they are separated, our

application makes use of the contours determined in the

single boxes.

Finally Fig. 15 gives an example where both handstouch the head at the same time. After separation each

box is tracking correctly the corresponding object.

Actually, some assumptions for our skin-colour

segmentation method have been made.

No sudden change of illumination Long sleeves of wearing clothes

Normal motion speed of hands while gesticulating

Minor changes in the illumination can be considered

easily as in every new image actual skin-coloured pixels

are determined. Based on these pixels, new thresholdscan be derived.

The restriction to long sleeves is mainly determined by

the size of the bounding boxes. A larger bounding boxis increasing the computational effort, which may result

in a lower frame rate under certain circumstances.

Nevertheless, experiments have successfully shown a

segmentation of participants wearing T-shirts.The speed of moving hands may cause to a misleading

tracking. But in online test, it turned out that hands have

to be moved quite fast, which is not expected as a

normal gesticulating behaviour.

8 CONCLUSION

In this paper a new robust method for accurate

segmentation of hands has been presented running

successfully in a real-time application, processing TV-sized images (576x720, 25Hz). A new method was

proposed to adjust the thresholds for skin-colour

segmentation automatically according to the specific

participant and the illumination conditions. The required

region of interest of skin-coloured pixels is determinedquite robust using a new histogram technique. The

segmentation is performed in bounding boxes

surrounding the hands in order to reduce thecomputational effort. These boxes are tracked

continuously, whereas the method is able to handle


7/7

contact between both hands and contact between hands

and face without loosing the tracking objects, namely

the hands. A continued analysis strategy controls

tracking and segmentation and presents accurate handmasks. Beside videoconferencing other applications are

kept in mind for successful use of the presented

methods, such as advanced gesture recognition tools,

post production using 3D or photo-realistic rendering.

The presented approach can be easily extended for

several specific necessities e.g. processing hands of twoor more persons.

9 ACKNOWLEDGEMENT

This work is supported by the Deutsche Forschungs-

gemeinschaft (DFG) under grant number DD 20 9 11.

REFERENCES

1. Y. Cui, J. Weng, 1996, Int. Conf on Pattern

Recognition, 617-621.

2. K. Imagawa, S. Lu, S. Igi, 1998, Int. Conf. on

Automatic Face and Gesture Recognition, 462467.

3. D. Guo, Y. Yan, M. Xie, 1998, Int. Conf. on

Control, Automation and Computer Vision.

4. X. Zhu, J. Yang, A. Waibel, 2000, Int. Conf.

Autom. Face Gesture Recognition, 446-453.

5. T. Starner, B. Leibe, D. Minnen, T. Westyn, A.

Hurst, J. Weeks, 2003, Machine Vision and

Applications, Vol. 14(1), 59-71.

6. K. Dorfmueller-Ulhaas, D. Schmalstieg, 2001,

ACM/IEEE Int. Symp. on Augmented Reality, 30-

44.

7. Y. Sato, Y. Kobayashi, H. Koike, 2000, Int. Conf.

on Automatic Face and Gesture Recognition, 462

467.

8. S. Malassiotis, F. Tsalakanidou, N. Mavridis, V.

Giagourta, N. Grammalidis, M.G. Strintzis, 2001,

Int. Conf. on Image Processing, 955-958.

9. C. Jennings, 1999, Int. Workshop on Recognition,

Analysis and Tracking of Faces and Gestures in

Real-Time Systems, 152-160.

10. B.C. Lovell, D. Heckenberg, 2002, AsianConference on Computer Vision, 336-341.

11. R. Herpers, W. J. MacLean, C. Pantofaru, L. Wood,

K. Derpanis, D. Topalovic, J. Tsotsos, 2001, Int.Workshop on Recognition, Analysis and Tracking

of Faces and Gestures in Real-Time Systems, 133-

144.

12. P. Kauff, O.Schreer, 2002, IEEE Conf. onMultimedia and Expo.

13. B. J. Lei, E. A. Hendriks, 2001, Vision, Modeling

and Visualization, 185-192.

14. O. Schreer, N. Brandenburg, S. Askar, P. Kauff,2001, Vision, Modeling and Visualization, 383-

390.

15. R. R. Anderson, J. Hu, J. A. Parrish, 1981,

Bioengineering and the Skin, chapter 28, 253-265.

16. M. Strring, H.J. Andersen, E. Granum, 1999,

Symp. on Intelligent Robotics Systems, 187-195.

Documents

vision based skin color segmentation