vision based skin color segmentation

Embed Size (px)

Citation preview

  • 8/10/2019 vision based skin color segmentation

    1/7

    VISION-BASED SKIN-COLOUR SEGMENTATION OF MOVING HANDS

    FOR REAL-TIME APPLICATIONS

    S. Askar, Y. Kondratyuk, K. Elazouzi, P. Kauff, O. Schreer

    Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut

    Germany

    ABSTRACT

    We present a robust vision-based skin-colour

    segmentation method for moving hands in a real-time

    application. Segmentation of hands is an important

    processing step in gesture recognition applications,

    where the general shape and position of the hands are of

    interest. In contrast to these approaches, the presented

    method concentrates on an accurate segmentation,which is required for further processing steps in a real-

    time videoconferencing application. A hand tracking

    procedure is applied to improve the segmentation in

    terms of accuracy, robustness and processing speed.

    Furthermore the presented approach can accomplish

    difficult situations like contact between hands or contact

    between face and hands. This is important for many

    real-time applications, e.g. for the presented

    videoconference system to allow the conferees a natural

    behaviour. Moreover we present an approach for an

    automatic initialisation of the skin-colour range to the

    specific user. We show experimental results proving the

    efficiency and reliability of our approach. The proposedhand segmentation method is capable of processing TV-

    sized (CCIR 601, 576x720 pixels) video images in real-

    time with 25 Hz on a common PC. The presented

    approach will support any video processing in visual

    media production, where segmentation accuracy and

    real-time capability is required.

    1 INTRODUCTION

    Numerous applications use skin-colour as one of the

    basic features for detecting or analysing human face orhands. They have different aims and different

    constraints under which the human face or hands are

    being analysed. One crucial point, which is common for

    most of the applications in this context, is an accurate

    segmentation of human face or hands. Manyapplications deal with segmentation of hands, such as

    hand sign recognition, human vehicle interaction,

    human computer interfaces, but common to all is a

    rough segmentation result as other features are derived

    (refer to Cui (1), Imagawa (2), Guo (3), Zhu (4), Starner(5)). In some hand segmentation approaches marked

    gloves are used, which are not applicable in video

    conferencing systems ( see Dorfmueller (6)). In othersapproaches infrared cameras are used or depth

    information based on multi-views is exploited e.g Sato

    (7), Malassiotis (8), Jennings (9). The real-time

    constraint is considered as well, but just in gesture

    recognition applications e.g. in Lovell (10), Herpers

    (11). The combination of accurate segmentation of both

    hands, robust tracking including overlap of hands and

    head, and real-time capability on high resolution video

    has not been considered in any publication before.

    The presented method on segmentation of hands using

    skin-colour is a resulting work of a project on

    immersive 3D video conferencing, which is beingdeveloped at FhG/HHI (see Kauff (12)). In this system,

    segmentation of hands is an important part to improve

    different succeeding processing steps such as disparity

    estimation or synthesis of virtual views.

    The approach is a new robust segmentation method for

    moving hands, which can also handle contact between

    hands and contact between hands and head. The

    fundamental algorithm is based on skin-colour

    segmentation and uses so called bounding boxes, which

    tracks hands and head separately (Fig. 1). A spatial sub-

    sampling in the considered bounding boxes guarantees a

    more robust and additionally a fast segmentation.

    Furthermore we present a new initialisation algorithm,which determines the specific skin-colour values

    automatically based on the first few images. This is very

    important in order to limit the skin colour range to the

    specific person and to achieve robustness and accuracy.

    Our algorithms process TV-sized video images (CCIR

    601, (576x720) pixels) in real-time with 25 Hz.

    Fig. 1:Hand contours and bounding boxes

    In contrast to many approaches in the field of e.g.

    gesture recognition or controlling movements, we arenot interested in features of the hands like orientation of

    the hands or the motion. The key problem in our

    application is to find a closed region, which coincides as

    much as possible with the real contour of the hand to

    specify depth discontinuities. In the proposed method,

    an initialisation of the specific skin colour range isfollowed by the real-time segmentation, which consists

    of two succeeding steps: the tracking of hands and the

  • 8/10/2019 vision based skin color segmentation

    2/7

    skin-colour based segmentation (see Fig. 2). The

    tracking of hands is performed on a sub-sampled QCIF

    image, whereas in the second step a region growing

    approach is applied to the full resolution image toextract the final hand segments.

    Determinationof skin colour

    range

    Hand-tracking

    Skin-coloursegmentation

    Tracking SegmentationInitialisation

    on sub-

    sampled imageon originalimage size

    Fig. 2:Block diagram of the presented method

    In the next section, the concept of immersive 3D

    videoconferencing is described and the relevance of

    accurate segmentation of hands in this context is

    explained. Then, the automatic initialisation of thespecific skin-colour range of the participant is

    presented. In section 4 and 5 the tracking of hands and

    the skin colour segmentation are proposed. In section 6,

    a solution in the case of overlapping skin-coloured

    regions is presented. Experimental results are shown in

    section 7. The paper ends with a conclusion.

    2 IMMERSIVE 3D-VIDEOCONFERENCING

    The basic idea of immersive 3D videoconferencing isthat the participants are perceiving the virtual

    conference scene under correct perspective. This

    includes full eye contact with the remote participants,

    although the cameras are mounted around the display. It

    is achieved by a synthesis of virtual views in order tosimulate a virtual camera at the correct position on the

    display (see Lei (13)). The current demonstrator of the

    immersive 3D videoconferencing system is shown in

    Fig. 3.

    To achieve the correct perspective view of the remoteparticipants, a real-time capable disparity estimator hasbeen developed calculating depth information from

    stereo camera images (see Schreer (14)). Although this

    disparity estimator provides convincing results, it fails

    at depth discontinuities in occluded areas, where pixel

    correspondences can not be calculated. This leads to

    artefacts in the synthesized views. Due to the nature ofthe videoconferencing application depth discontinuities

    mainly occur at the contours of the free gesticulating

    hands. Due to the fact that artefacts at these areas

    extremely bother the impression of immersiveness and

    natural representation of the remote participants, a

    reduction of these effects is desired.

    Fig. 3:Demonstrator of the immersive 3D

    videoconferencing system

    Accurate segmentation masks of both hands provide a

    very helpful information in order to improve thedisparity estimation by replacing wrong or unknown

    disparities with reliable values (see Schreer (14)). The

    colour of human hands is a striking feature to offer a

    solution to this problem. Segmentation of skin-colour

    can provide information about depth discontinuity at thecontour area of the hands.

    3 INITIALISATION OF SKIN-COLOUR

    Usually, TV- and video data are available in the YUV-

    colour space. Hence, the investigations in this approachhave been made in the YUV-colour space. Other colour

    spaces (e.g. HSV, HSI) aim to provide a more uniform

    and accurate representation of colour interpreting it in

    the same way as the human perceptual system does. Butthe transformation from video signals to this colour

    spaces is a very time consuming process and needs to be

    avoided in real-time applications. In our proposed

    algorithm we consider only the chrominance (U,V

    channel), which are fully representing the colour. The

    We skin-colour is described as a quadruple consisting of

    the mean values mu, mvand the tolerance values tu, tv.

    The spectral reflectance of human skin is independent

    on the human race and on the wavelength of theexposed light (see Anderson (15)). The same

    observation can be made considering the transformed

    colour in common video formats. Hence, the human

    skin-colour can be defined as a global skin-colour

    cloud in the colour space (see Strring (16)). The

    general thresholds for this striking area are still too large

    to obtain reasonable segmentation results. Depending on

    different factors like shadows, illumination, colour

    distribution in the particular video data, different

    pigmentation of the persons skin and so on, it is useful

    to adapt thresholds to the given illumination conditions

    and the observed person. Therefore, it is assumed that

    the skin-colour of a human under certain conditions canbe considered as a subset of a global skin-colour

    cloud. Hence we distinguish between the following

  • 8/10/2019 vision based skin color segmentation

    3/7

    two terms: 1) global skin-colour, representing skin-

    colour in a general way with large tolerance values, and

    2) skin-colour, representing the skin-colour for the

    specific person under certain illumination conditionsdescribed with specific mean values and a reduced

    tolerances.

    Hence, an important question arises how to determine

    appropriate skin-colour parameters for a scenario to

    achieve best segmentation results. Applying parameters

    from general statistical analysis of skin-colour does notlead to optimal segmentation results in the majority of

    cases. But they can be often used as good coarse start

    values to find appropriate parameters by slightly

    varying them.

    One option is to adapt the thresholds manually at thebeginning of the segmentation. This is obviously not

    convenient in terms of usability and user friendliness of

    in the case of a video conferencing system. Therefore, a

    quasi-automatic method is presented to find suitable

    parameters. Nevertheless in the case of extreme dark orextreme bright illumination an additional manualadjustment is unavoidable. However we experienced,

    that for brighter illumination, it is reasonable to chose

    larger tolerance values than for dark cases.

    The initialisation step is performed in the sub-sampled

    image for real-time and stability purposes. Beside the

    desired skin-colour range, it provides also three centresof gravity of the two hands and the head, which are used

    as start positions for the bounding boxes. In the first

    image, a pixelwise skin-colour segmentation is

    performed. For the initial skin colour range, threshold

    values are obtained from statistical analysis of a number

    of images representing the global skin-colour cloud.After applying the global thresholds, a rough binary

    mask is obtained, which is filtered to reduce noise.

    The goal of the following process is to determine the

    blob position of both hands and the head and to

    calculate new and more accurate skin-colour threshold

    values in the distinct area. Hence, the row and column

    histograms of the binary image are calculated, which

    represents the skin coloured pixel-distribution in

    horizontal and vertical direction (Fig. 4).

    Fig. 4:Row and column histogram of binary image

    The image is now divided in three equal stripes inhorizontal and vertical direction, which leads to nine

    equal areas in the whole image. For each stripe the

    maximum in the corresponding histogram interval is

    determined (Fig. 5). The points of intersection of the

    horizontal and vertical maxima yield nine potential

    positions of the centres of gravity of possible hand orhead blobs (Fig. 6). Obviously some of them have to be

    wrong.

    A neighbourhood analysis searching for the points with

    the most skin-coloured neighbour pixels removes wrong

    points. The resulting three positions mark the three skin-

    colour blobs: left hand, right hand and face (Fig. 7).

    Fig. 5:Determination of maximum in each stripe

    Fig. 6:Points of intersection based on horizontal and

    vertical maxima

    Fig. 7:Resulting blob positions

    The proposed histogram method is independent on the

    orientation of the camera. Obviously this approachworks reliably, if the hands and head are in different

    image regions, but not necessarily at specific positions.

  • 8/10/2019 vision based skin color segmentation

    4/7

    In order to distinguish between the face and the hands

    the assumption was made, that the object on the top is

    related to the face. In the case of larger rotations of the

    camera, this information must to be taken into accountto assign the blob position of the face correctly (eg.

    most left object, most right object, ...). In summary a

    few rules have to be considered by the observed person,

    but very simple and general ones. The experiments have

    proven, that the correct blob positions are computed

    reliably after a few frames.In our online real-time application, the initialisation will

    be repeated as long as three separate and reliable blob

    positions are determined. After successful computation

    of the blob positions, the skin-coloured pixels around

    are analysed and the specific new mean values andtolerances are derived. After delivering of the

    initialisation parameters, the vision-based hand

    segmentation starts. During the hand segmentation

    process the bounding boxes are kept under surveillance.

    Some hints, e.g. no segmented pixels in a box or ifboxes leave the image area, lead to re-initialisation. Forthese cases we assume either failure with tracking

    process, or change of the environment conditions

    (shadows, illumination change).

    4 TRACKING

    An enormous optimisation in speed can be achieved, if

    the segmentation of the hands is limited to a specific

    region. This is very important for real-time applications.

    Therefore we define two bounding boxes, which trackthe hands of the conferee continuously during the

    conference session and the search is performed just

    inside these boxes. After a pixelwise skin-colour

    segmentation inside each bounding box, we calculate

    the centre of gravity of the obtained skin-colour area in

    the whole box. The purpose of the tracking phase is to

    determine the centres of gravity of the hands using the

    previous bounding box area. Then a new bounding box

    position is calculated and delivered to the succeeding

    segmentation step. In Fig. 8,left, the calculation of the

    new centre of gravity inside the old box is depicted. The

    shifted box to the new position is shown in Fig. 8, right.

    old boundingbox position

    new boundingbox position

    get newcentre

    ofgravity

    newcentre

    ofgravity

    Fig. 8:Tracking of the centre of gravity

    Moreover we perform the whole tracking process in the

    sub-sampled image to achieve further reduction of

    processing time. As in the tracking step only a blob-

    tracking is performed, it is sufficient to apply thisprocedure in the sub-sampled image. The usage of

    bounding boxes and the tracking in the sub-sampled

    image have much more impact beside real-time

    capability as the segmentation becomes extremely

    robust. Pixels spatially far apart from hands and head,

    but coloured similar to skin-colour do not have anyinfluence to segmentation result. In addition, the

    tracking process in the sub-sampled image filtered as

    well and thus leads to a reduction of noise concerning

    the blob positions.

    5 SKIN-COLOUR SEGMENTATION

    For the succeeding accurate segmentation of the handsin the high resolution image, a skin-colour segmentationmethod based on a region growing technique has been

    developed. The region growing approach requires a so

    called seed point for the segmentation, which is

    provided by the previous tracking process. Starting from

    this point the segmented area is enlarged by analysing

    continuously the neighbours of the segmented pixels.The advantage of this technique is that it leads to one

    closed region.

    The region growing approach accounts for the case that

    the gravity point obtained from the tracking step does

    not have skin-colour, e.g. because it lies between two

    fingers or because hand boxes overlap and misleadtracking. Additionally, the situation is borne in mind, if

    a contact between hands and face happens. Both cases

    are discussed in the following section.

    Fig. 9:Region growing

    6 CONTACT OF SKIN-COLOURED AREAS

    In order to allow the participants of the conferencesession a natural behaviour with free gestures, contact

    between both hands and also contact between hands and

    face has to be considered. This is done twice, in thetracking step as well as in the segmentation step. If the

  • 8/10/2019 vision based skin color segmentation

    5/7

    hands are very close to each other the following

    problem occurs in the tracking phase. For example, if

    the left hand box also detects a part of the right hand in

    the box (see Fig. 10, left), then tracking could bemislead. Due to the segmented parts of the other hand,

    the centre of gravity could be shifted to a wrong

    position with no skin-colour.

    In the worst case, the search for a skin-coloured region

    in the neighbourhood of the determined non-skin-

    coloured centre of gravity will lead to the wrong righthand and the left hand might be lost. In order to avoid

    this case, the search directions, starting from the

    determined centre of gravity, are limited to the opposite

    of the position of the right hand box and vice versa (Fig.

    10, right). A fast and sophisticated retrieval strategypreserves the correct hand.

    part of lefthand

    part ofright hand

    left handbounding box

    searchdirections

    Fig. 10:Contact of hand boxes

    If both hands have contact to each other, then the

    bounding boxes overlap. If the hands come apart, the

    bounding boxes must be separated obviously, which isagain not trivial. To overcome with this problem, thefollowing approach has been implemented to separate

    the bounding boxes, when the hands get separated. For

    each bounding box favourite directions are defined e.g.

    left -bottom edge for one box and right-top edges for the

    other. While the hands are in contact, the boxes are just

    allowed to move in the preferred directions. If the handsare not connected any more, the preference gets

    switched off and the movement of the bounding boxes

    is not limited further on. Example images of a sequence

    are presented in the next section.

    If a hand has contact with the face, the following is

    performed: In addition to the hands, the head blob of theparticipant resulting from the initialisation phase is

    tracked as well in the sub-sampled image, using a third

    bounding box. If one of the hand boxes overlaps with

    the head box, then only the non-overlapping part of the

    hand box is considered for tracking the centre of

    gravity. This is shown in Fig. 11,left.Thus, wrong movement of the hand box towards the

    head is avoided and tracking becomes much more

    robust without loosing the hand. Otherwise, if a hand

    disappears completely inside the head box, an area

    surrounding the head box is observed waiting for the

    hand to come out of the head box (see Fig. 11,right). Ifskin-coloured pixel are detected in the surrounding area,

    the tracking of the hand is going to be continued.

    observedarea

    handdetection

    unconcernedarea

    Fig. 11:Contact of hand and head box

    7 EXPERIMENTAL RESULTS

    The presented methods are running on a standard PC

    Pentium IV, 2GHz in real-time on full TV resolution

    video (576x720 pixels at 25Hz). Hence all situations

    such as different behaviour and gestures, have been

    tested under real conditions. The following extractedimages of a sequence will show the robustness in

    several situations and the accuracy of the segmentation.

    In Fig. 12,an example is given, where the hands contact

    each other and come apart. After the contact tracking is

    still successful and the bounding boxes can be separated

    correctly.

    In Fig. 13,a misleading tracking is shown. In this case

    the right hand box is getting lost after contact of the

    hand with the face. Instead of it, the face of the person is

    wrongly tracked. The successful operation of our

    method is shown in Fig. 14 and Fig. 15 for situations,where just a single, but also both hands have contact

    with the face region.

    Fig. 12:Contact of hands

  • 8/10/2019 vision based skin color segmentation

    6/7

    Fig. 13:Contact of hand and head box, hand box is lost

    Fig. 14:Contact between hand and head, correct tracking

    Fig. 15:Contact of both hands and head together, correct tracking (order: left to right)

    The image series (Fig. 14) shows, that the right hand

    box still tracks the hand correctly after the contact usingthe head box processing method. Despite the robust

    tracking it must be indicated that with our algorithm it is

    not possible to determine the contours of the objects

    while they are connected. Only if they are separated, our

    application makes use of the contours determined in the

    single boxes.

    Finally Fig. 15 gives an example where both handstouch the head at the same time. After separation each

    box is tracking correctly the corresponding object.

    Actually, some assumptions for our skin-colour

    segmentation method have been made.

    No sudden change of illumination Long sleeves of wearing clothes

    Normal motion speed of hands while gesticulating

    Minor changes in the illumination can be considered

    easily as in every new image actual skin-coloured pixels

    are determined. Based on these pixels, new thresholdscan be derived.

    The restriction to long sleeves is mainly determined by

    the size of the bounding boxes. A larger bounding boxis increasing the computational effort, which may result

    in a lower frame rate under certain circumstances.

    Nevertheless, experiments have successfully shown a

    segmentation of participants wearing T-shirts.The speed of moving hands may cause to a misleading

    tracking. But in online test, it turned out that hands have

    to be moved quite fast, which is not expected as a

    normal gesticulating behaviour.

    8 CONCLUSION

    In this paper a new robust method for accurate

    segmentation of hands has been presented running

    successfully in a real-time application, processing TV-sized images (576x720, 25Hz). A new method was

    proposed to adjust the thresholds for skin-colour

    segmentation automatically according to the specific

    participant and the illumination conditions. The required

    region of interest of skin-coloured pixels is determinedquite robust using a new histogram technique. The

    segmentation is performed in bounding boxes

    surrounding the hands in order to reduce thecomputational effort. These boxes are tracked

    continuously, whereas the method is able to handle

  • 8/10/2019 vision based skin color segmentation

    7/7

    contact between both hands and contact between hands

    and face without loosing the tracking objects, namely

    the hands. A continued analysis strategy controls

    tracking and segmentation and presents accurate handmasks. Beside videoconferencing other applications are

    kept in mind for successful use of the presented

    methods, such as advanced gesture recognition tools,

    post production using 3D or photo-realistic rendering.

    The presented approach can be easily extended for

    several specific necessities e.g. processing hands of twoor more persons.

    9 ACKNOWLEDGEMENT

    This work is supported by the Deutsche Forschungs-

    gemeinschaft (DFG) under grant number DD 20 9 11.

    REFERENCES

    1. Y. Cui, J. Weng, 1996, Int. Conf on Pattern

    Recognition, 617-621.

    2. K. Imagawa, S. Lu, S. Igi, 1998, Int. Conf. on

    Automatic Face and Gesture Recognition, 462467.

    3. D. Guo, Y. Yan, M. Xie, 1998, Int. Conf. on

    Control, Automation and Computer Vision.

    4. X. Zhu, J. Yang, A. Waibel, 2000, Int. Conf.

    Autom. Face Gesture Recognition, 446-453.

    5. T. Starner, B. Leibe, D. Minnen, T. Westyn, A.

    Hurst, J. Weeks, 2003, Machine Vision and

    Applications, Vol. 14(1), 59-71.

    6. K. Dorfmueller-Ulhaas, D. Schmalstieg, 2001,

    ACM/IEEE Int. Symp. on Augmented Reality, 30-

    44.

    7. Y. Sato, Y. Kobayashi, H. Koike, 2000, Int. Conf.

    on Automatic Face and Gesture Recognition, 462

    467.

    8. S. Malassiotis, F. Tsalakanidou, N. Mavridis, V.

    Giagourta, N. Grammalidis, M.G. Strintzis, 2001,

    Int. Conf. on Image Processing, 955-958.

    9. C. Jennings, 1999, Int. Workshop on Recognition,

    Analysis and Tracking of Faces and Gestures in

    Real-Time Systems, 152-160.

    10. B.C. Lovell, D. Heckenberg, 2002, AsianConference on Computer Vision, 336-341.

    11. R. Herpers, W. J. MacLean, C. Pantofaru, L. Wood,

    K. Derpanis, D. Topalovic, J. Tsotsos, 2001, Int.Workshop on Recognition, Analysis and Tracking

    of Faces and Gestures in Real-Time Systems, 133-

    144.

    12. P. Kauff, O.Schreer, 2002, IEEE Conf. onMultimedia and Expo.

    13. B. J. Lei, E. A. Hendriks, 2001, Vision, Modeling

    and Visualization, 185-192.

    14. O. Schreer, N. Brandenburg, S. Askar, P. Kauff,2001, Vision, Modeling and Visualization, 383-

    390.

    15. R. R. Anderson, J. Hu, J. A. Parrish, 1981,

    Bioengineering and the Skin, chapter 28, 253-265.

    16. M. Strring, H.J. Andersen, E. Granum, 1999,

    Symp. on Intelligent Robotics Systems, 187-195.