Hand Gesture Recognition using Input–Output Hidden Markov Models

Embed Size (px)

DESCRIPTION

A new hand gesture recognition method based on Input–Output Hidden Markov Models is presented. This methoddeals with the dynamic aspects of gestures. Gestures areextracted from a sequence of video images by tracking theskin–color blobs corresponding to the hand into a body–face space centered on the face of the user. Our goal is torecognize two classes of gestures: deictic and symbolic.

Citation preview

  • Hand Gesture Recognition usingInputOutput Hidden Markov Models

    Sebastien Marcel, Olivier Bernier, JeanEmmanuel Viallet and Daniel CollobertFrance Telecom CNET2 avenue Pierre Marzin

    22307 Lannion, FRANCE

    sebastien.marcel, olivier.bernier, jeanemmanuel.viallet, daniel.collobert @cnet.francetelecom.fr

    Abstract

    A new hand gesture recognition method based on InputOutput Hidden Markov Models is presented. This methoddeals with the dynamic aspects of gestures. Gestures areextracted from a sequence of video images by tracking theskincolor blobs corresponding to the hand into a bodyface space centered on the face of the user. Our goal is torecognize two classes of gestures: deictic and symbolic.

    1. Introduction

    Persons detection and analysis is a challenging problemin computer vision for human computer interaction. LISTEN is a realtime computer vision system which detectsand tracks a face in a sequence of video images coming froma camera. In this system, faces are detected by a modularneural network in skin color zones [3]. In [5], we developed a gesture based LISTEN system integratingskincolorblobs, face detection and hand posture recognition. Handpostures are detected using neural networks in a bodyfacespace centered on the face of the user. Our goal is to supply the system with a gesture recognition kernel in order todetect the intention of the user to execute a command. Thispaper describe a new approach for hand gesture recognitionbased on InputOutput Hidden Markov Models.

    InputOutput Hidden Markov Models (IOHMM) wereintroduced by Bengio and Frasconi [1] for learning problems involving sequential structured data. They have similarities to hidden markov models but allows to map inputsequences to output sequences. Indeed, for many trainingproblems, the data are of sequential nature and multilayerneural networks (MLP) are often not adapted because ofthe lack of memory mechanism to retain past information.Some neural networks models allow to capture the temporalrelations by using times in their connections (Time Delay

    Neural Networks) [11]. However, the temporal relations arefixed a priori by the network architecture and not by thedata themselves which generally have temporal windows ofvariable input size.

    Recurrent neural networks (RNN) model the dynamics of a system by capturing contextual information fromone observation to another. The supervised training forRNN is primarily focused on methods of gradient descent:BackPropagation Through Time [9], Real Time RecurrentLearning [13] and Local Feedback Recurrent Learning [7].However, training with gradient descent is difficult whenthe duration of the temporal dependencies is large. Previous work on alternative training algorithms [2], such asInput/Output Hidden Markov Models, suggest that the rootof the problem lies in the essentially discrete nature of theprocess of storing contextual information for an indefiniteamount of time.

    2. Image Processing

    We are working on image sequence in CIF format(384x288 pixels). In such images, we are interested in facedetection and hand gesture recognition. Consequently, wemust segment faces and hands from the image.

    2.1. Face and hand segmentation

    We filter the image using a fast lookup indexing table ofskin color pixels in YUV color space. After filtering, skincolor pixels (Figure 1) are gathered into blobs [14]. Blobs(Figure 2) are statistical objects based on the location (x,y)and the colorimetry (Y,U,V) of the skin color pixels in orderto determine homogeneous areas. A skin color pixel belongto the blob which have the same location and colorimetrycomponent.

  • ! #" $!

    %&('$"! ) +*-,! ,. /*01

    12"312,4" ! 4

    2.2. Extracting gestures

    We map over the user a bodyface space based on adiscrete space for hand location [6] centered on the faceof the user as detected by LISTEN. The bodyface spaceis built using an anthropometric body model expressed asa function of the total height of the user, itself calculatedfrom the face height. Blobs are tracked into the bodyfacespace. The 2D trajectory of the handblob1 during a gestureis called a gesture path.

    3. Hand Gesture Recognition

    Numerous method for hand gesture recognition havebeen proposed: neural networks (NN), such as recurrentmodels [8], hidden markov models (HMM)[10] or gestureeigenspaces [12]. On one hand, HMM allow to closelycompute the probability that observations could be generated by the model. On the other hand, RNN achieve goodclassification performance by capturing the temporal relations from one observation to another. However, they

    1center of gravity of the blob corresponding to the hand

    cannot compute the likelihood of observation. In this paper, we use IOHMM which have HMM properties and NNdiscrimination efficiency.

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    Y

    X

    DEICTIC

    SYMBOLIC

    65&7# 1984, ! 6"

    , 14;:"

    Our goal is to recognize two classes of gestures: deictic and symbolic gestures (Figure 3). Deictic gestures arepointing movements towards the left (right) of the bodyfacespace and symbolic gestures are intended to execute commands (grasp, clic, rotate) on the left (right) of shoulders.A video corpus was built using several persons executingseveral times these two classes of gestures. A database ofgesture paths was obtained by manual video indexing andautomatic blob tracking.

    4. InputOutput Hidden Markov Models

    The aim of IOHMM is to propagate, backward in time,targets in a discrete space of states, rather than the derivativesof the errors, as in NN. The training is simplified and hasonly to learn the outputs and the next state defining thedynamic behavior.

    4.1. Architecture and modeling

    The architecture of IOHMM consists of a set of states ? ,where each state is associated to a state neural network @Aand to an output neural network B)A where the input vectorC(D is the input at time E . A state network @GF has a number ofoutputs equal to the number of states. Each of these outputsgives the probabilityof transition from state H to a new state.

    4.2. Modeling

    Let CJI1 KC

    1 L>L>LC

    I

    be the input sequence (observationsequence) and M I1 K M 1 L>L>L M I the output sequence.

  • C is the input vector ( C IR ) with the input vectorsize and M is the output vector ( M IR ) with the outputvector size. is the number of input/output sequencesand is the length of the observed sequence. The setof input/output sequences is defined by

    K

    K

    C

    I

    1

    M

    I

    1

    , with

    K

    1L>L>L

    . The IOHMM model isdescribed as follows:

    ?

    D : state of the model at time E where ? D , K

    1L>L>L

    and

    is the number of states of the model, : set of successor states for state , ! ," : set of final states, "# .

    The dynamic of the model is defined by :

    ?

    D

    K%$

    ?

    D&

    1

    CD

    M

    D

    K('

    ?

    D

    CD

    (1))

    F is the set of parameters of the state network @GF ( *HK

    1L>L>L

    ), where +-F-, DK

    I/. 0

    1F-,D

    L>L>L

    021

    F-,

    D43 is the output ofthe state network @/F at time E , with the relation 05 F-, D

    K

    6

    ?

    D

    K

    87?

    D&

    1K

    H

    CD

    , i.e. the probability of transitionfrom state H to state , with 9

    1

    ;:

    102

    F-,

    D

    K

    1. < F is the setof parameters of output network B F ( *H

    K

    1L>L>L

    ), where=

    F-,

    D is the output of the output network B F at time E , withthe relation > F-, D

    K

    6

    @?

    ,

    D

    7?

    D

    K

    H

    CD

    . Let us introducethe following variables in the model:AD : memory of the system at time E , AD CB

    1

    :

    A

    D

    K

    1

    D

    F

    :

    1

    E

    F-,

    D&

    1 + F-,D for EGF

    K

    0

    whereE

    F-,

    D

    K

    6

    ?

    D

    K

    HH7

    C

    D

    1

    and A 0 is randomlychosen with 9

    1

    F

    :

    1E

    F-, 0K

    1,=!I

    D : global output of the system at time E , =8ID IR is:

    =

    I

    D

    K

    1

    D

    F

    :

    1

    E

    F-,

    D=

    F-,

    D (2)

    with the relation =JIDK

    6

    M

    D

    7

    C

    D

    1

    , i.e. the probability to have the expected output M D knowing the inputsequence C

    D

    1,

    $LK

    M

    D ; =M , D

    : probabilitydensity function (pdf) of outputs where

    $NK

    M

    D ; =M , D

    K

    6

    M

    D

    7?

    D

    K

    CD

    , i.e. theprobability to have the expected output M D knowing thecurrent input vector C D and the current state ? D .

    We formulate the problem of the training as a problemof maximization of the probability function of the set ofparameters of the model on the set of training sequences.The likelihood of input/output sequences (Equation 3)

    is, as in HMM, the probability that a finite observationsequence could be generated by the IOHMM.

    O

    K

    6

    ;

    7

    K

    P

    Q

    R

    :

    16

    M

    I

    1 7C

    I

    1

    (3)

    where is the parameter vector given by the concatenationof SL< FT et S

    )

    FUT . We introduce the EM algorithm as aiterative method to estimate the maximum of the likelihood.

    4.3. The EM algorithm

    The goal of the EM algorithm (Expectation Maximization) [4] is to maximize the function of loglikelihood(Equation 4) on the parameters of the model given thedata . V

    K

    V

    W

    '

    O

    (4)

    To simplify this problem, the EM assumption is to introduce a new set of parameters X known as the hidden set ofparameters. Thus, we obtain a new set of data ZY

    K

    X

    ,

    called the complete set of the data, of loglikelihood function

    V

    GY

    . However, this function cannot be maximizeddirectly because X is unknown. It was already shown[4] that the iterative estimation of the auxiliary function[ (Equation 5), using the parameters ^ of the previousiteration, maximizes

    V

    6Y

    .

    [

    ^

    K]\_^

    .

    V

    GY

    7`

    ^3 (5)

    Computing[

    corresponds to supplement the missingdata by using knowledge of the observed data and of theprevious parameters. The EM algorithm is the following: For a

    K

    1L>L>Lb

    , whereb

    is a local maxima

    Estimation step: computation of[

    c d&

    1 e Kf\_^

    .

    V

    GY

    7 `

    c d&

    1 e 3

    Maximization step: c d e

    K

    arg max[

    c d&

    1 e

    Analytical maximization is done by cancelling the partialderivatives gih c ,

    ^ eg

    K 0.

    4.4. Training IOHMM using EM

    Let be the set of states sequences, K

    ;j

    I

    1

    with

    K

    1L>L>L

    , the complete data set is:

    GY

    K

    k

    K

    C

    I

    1

    M

    I

    1

    j

    I

    1 l

    K

    1L>L>L

  • and the likelihood on 6Y is:O

    GY

    K

    6

    ;k

    7

    K

    P

    Q

    R

    :

    1

    6

    M

    I

    1

    j

    I

    1

    7

    C

    I

    1

    For convenience, we choose to omit the

    variable inorder to simplify the notation. Furthermore, the conditionaldependency of the variables of the system (Equation 1) allows us to write the above likelihood as:

    O

    GY

    K

    P

    Q

    R

    :

    1

    I

    Q

    D@:

    1

    6

    M

    D

    ?

    D

    7?

    D&

    1

    CD

    Let us introduce the variable D

    ,

    D

    K

    1 : ? DK

    0 : ? D FK

    the loglikelihood is then:V

    GY

    K

    V

    W

    '

    O

    GY

    K

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    1

    D

    ;:

    1

    ,

    D

    V

    W

    '

    6

    M

    D

    7?

    D

    K

    CD

    1

    D

    F

    :

    1

    ,

    D

    F-,

    D&

    1

    V

    W

    '

    6

    ?

    D

    K

    87?

    D&

    1K

    H

    CD

    However, the set of states sequences is unknown, andV

    GY

    cannot be maximize directly. The auxiliary function

    [

    must be computed (Equation 5):[

    ^

    Kf\

    .

    V

    Y

    GY

    7

    k ^3

    K

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    1

    D

    ;:

    1

    ^E

    ,

    D

    V

    W

    ' $LK

    M

    D ; = ,

    D

    1

    D

    F

    :

    1

    ^

    F-,

    D

    V

    W

    '

    02

    F-,

    D

    where ^ F-, D is computed using ^ as follows:

    F-,

    D

    K

    6

    ?

    D

    K

    ?

    D&

    1K

    H 7

    C

    I

    1

    M

    I

    1

    K

    F-,

    D&

    102

    F-,

    D

    ,

    D

    $LK

    M

    D ; = ,

    D

    O

    and OK

    6

    M

    I

    1 7CJI

    1

    ,

    ,

    D and , D are computed (see[1] for details) using equations (6) and (7).

    ,

    D

    K

    6

    M

    D

    1

    ?

    D

    K

    87

    C

    D

    1

    K $LK

    M

    D ; = ,

    D

    1

    D

    F

    :

    1

    02

    F-,

    D

    F-,

    D&

    1 (6)

    ,

    D

    K

    6

    M

    I

    D

    1 7?D

    K

    C

    I

    D

    K

    1

    D

    F

    :

    1$LK

    M

    D

    1; = F-, D 1

    F-,

    D

    10

    F

    ,

    D

    1 (7)

    Then O is given by:O

    K

    6

    M

    I

    1 7C

    I

    1

    K

    D

    6

    M

    I

    1

    ?

    I

    K

    87

    C

    I

    1

    K

    D

    ,

    I

    The learning algorithm is as follow: for each sequence

    CJI

    1

    M

    I

    1

    and for each state HK

    1L>L>L

    , we compute + F-, D ,=

    F-,

    D

    , then

    ,

    D

    ,

    ,

    D and F-, D ( * F ). Then we adjust)

    F parameters of the state networks @GF to maximize theequation (8).

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    1

    D

    ;:

    1

    1

    D

    F

    :

    1

    ^

    F-,

    D

    V

    W

    '

    02

    F-,

    D (8)

    We also adjust < F parameters of the output networks B Fto maximize the equation (9).

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    1

    D

    ;:

    1

    ^E

    ,

    D

    V

    W

    ' $LK

    M

    D ; = ,

    D

    (9)

    Let Fd

    be the set of parameters of state networks @GF .The partial derivatives of the equation (8) are given by:

    [

    ^

    F

    d

    K

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    D

    ^ F-,

    D

    102

    F-,

    D

    02

    F-,

    D

    F

    d

    where the partial derivatives g

    g

    are computed using classic backpropagation in the state network @GF .

    Let d

    be the set of parameters of output network B .The partial derivatives of the equation (9) are given by:

    [

    ^

    d

    K

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    ^E

    ,

    D

    V

    W

    '/$LK

    M

    D ; = ,

    D

    d

    K

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    ^E

    ,

    D

    D

    F

    :

    1

    V

    W

    '/$LK

    M

    D ; = ,

    D

    > F

    ,

    D

    > F

    ,

    D

    d

    As before, partial derivative g

    g

    can be computedby backpropagation in output networks B . The pdf$LK

    M

    D ; =M , D

    depends on the problem.

    4.5. Applying IOHMM to gesture recognition

    We want to discriminate a deictic gesture from a symbolicgesture. Gesture paths are sequences of [ D ? D ? D ] observations, where ?! ,

    ?

    D are the coordinate at time E and D isthe sampling interval. Therefore, the input size is

    K

    3,and the output size

    K

    1. We choose to learn?

    1K

    1 asoutput for deictic gestures and

    ?

    1K

    0 as output for symbolicgestures.

  • Furthermore, we assume that the pdf of the model is$LK

    M

    D ; =M , D

    K

    & 12 9

    1 c

    @&

    e

    2, i.e. an exponential

    Mean Square Error. Then, partial derivatives of the equation(9) becomes:

    [

    ^

    d

    K

    P

    D

    R

    :

    1

    I

    D

    D@:

    1

    ^E

    ,

    D

    D

    F

    :

    1

    @?

    F-,

    D

    > F

    ,

    D

    > F

    ,

    D

    d

    Our gesture database (Table 1) is divided into three subsets: the learning set, the validation set and the test set. Thelearning set is used for training the IOHMM, the validation set is used to tune the model and the test set is usedto evaluate the performance. Table 1 indicates in the firstcolumn the number of sequences. The second, third andfourth columns respectively indicates the minimum numberof observations, the mean number of observations and themaximum number of observations.

    &,! 7#

  • Unfortunately, untrained gestures, i.e. the deictic andsymbolic retractation gestures, cannot be classified neitherby the output of the MLP based on interpolated gestures norby the global output of the IOHMM (Figure 5).

    0.04

    0.05

    0.06

    0.07

    0.08

    0.09

    0.1

    0.11

    0.12

    0 10 20 30 40 50 60 70 80 90 100

    TRAINED

    UNTRAINED

    >! 1 +*

    ( 11 (# 1=

    Nevertheless, it is possible to estimate for the IOHMM, a likelihood cue that can be used to stand out trained gesturesfrom untrained gestures (Figure 6). This likelihood cuecan be computed in a HMM way by adding to each state ofthe model an observation probability of the input C-D .

    6. Conclusion

    A new hand gesture recognition method based on Input/Output Hidden Markov Models is presented. IOHMMdeal with the dynamic aspects of gestures. They have Hidden Markov Models properties and Neural Networks discrimination efficiency. When trained gestures are encountered the classification is as powerful as the neural networkused. The IOHMM use the current observation only andnot a temporal windows fixed a priori. Furthermore, whenuntrained gestures are encountered, the likelihood cue ismore discriminant than the global output.

    Future work is in progress to integrate the hand gesturerecognition based on IOHMM into the LISTEN based system. The full system will integrate face detection, handposture recognition and hand gesture recognition.

    References

    [1] Y. Bengio and P. Frasconi. An Input/Output HMM architecture. In Advances in Neural Information ProcessingSystems,page 427434, 1995.

    [2] Y. Bengio, P. Simard, and P. Frasconi. Learning longtermdependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157166, 1994.

    [3] M. Collobert, R. Feraud, G. LeTourneur, O. Bernier, J. Viallet, Y. Mahieux, and D. Collobert. LISTEN: A system forlocating and tracking individual speakers. In 2nd Int. Conf.on Automatic Face and Gesture Recognition, page 283288,1996.

    [4] A. Dempster, N. Laird, and D. Rubin. Maximumlikelihoodfrom incomplete data via the EM algorithm. Journal of RoyalStatistical Society, 39:1938, 1977.

    [5] S. Marcel. Hand posture recognition in a bodyface centeredspace. In CHI99, page 302303, 1999. Extended Abstracts.

    [6] D. McNeill. Hand and Mind: What gestures reveal aboutthought. Chicago Press, 1992.

    [7] M. Mozer. A focused backpropagation algorithm for temporal pattern recognition. Complex Systems, 3:349381,1989.

    [8] K. Murakami and H. Taguchi. Gesture recognition usingrecurrent neural networks. In Conference on Human Interaction, page 237242, 1991.

    [9] D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing, volume 1, page 318362. MIT Press,Cambridge, 1986.

    [10] A. Starner and T. Pentland. Visual recognition of AmericanSign Language using Hidden Markov Models. In Int. Conf.on Automatic Face and Gesture Recognition, page 189194,1995.

    [11] A. Waibel, T. Hanazawa, H. G., K. Shikano, and K. Lang.Phoneme recognition using timedelay neural netwoks.IEEE transactions on Acoustics, Speech and Signal Processing, 37:328339, 1989.

    [12] T. Watanabe and M. Yachida. Realtime gesture recognitionusing eigenspace from multi input image sequences. In Int.Conf. on Automatic Face and Gesture Recognition, page428433, 1998.

    [13] R. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. NeuralComputation, 1:270280, 1989.

    [14] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland.Pfinder: Realtime tracking of the human body. In IEEETransactions on Pattern Analysis and Machine Intelligence,volume 19 of 7, page 780785, 1997.