10.1.1.86.2333

Embed Size (px)

Citation preview

  • 8/13/2019 10.1.1.86.2333

    1/4

    Gesture Recognition for Virtual Reality Applications

    Using Data Gloves and Neural Networks

    John Weissmann, Department of Computer Science, University of Zurich, [email protected],

    Ralf Salomon, Department of Computer Science, University of Zurich, [email protected]

    AbstractThis paper explores the use of hand gestures as a means

    of human-computer interactions for virtual reality

    applications. For the application, specific hand gestures,

    such as fist, index finger, and victory sign, have

    been defined. Most exisiting approaches use various

    camera-based recognition systems, which are rathercostly and very sensitive to environmental changes.

    In contrast, this paper explores a data glove as the input

    device, which provides 18 measurement values for the

    angles of different finger joints. This paper compares the

    performance of different neural network models, such as

    back-propagation and radial-basis functions, which are

    used by the recognition system to recognize the actual

    gesture.

    Some network models achieve a recognition rate (training

    as well as generalization) of up to 100% over a number of

    test subjects. Due to its good performance, this

    recogniton system is the first step towards virtual reality

    applications in which program execution is controlled by

    a sign language.

    IntroductionCurrently, interactions with virtual reality (VR)

    applications are done in a simple way. Even when

    sophisticated devices such as space balls, 3D mice or data

    gloves are present, they are mainly used as a means for

    pointing and grabbing, i.e. the same I/O-paradigm as is

    used with 2D mice. However, it has been shown [1], forexample, that experienced users work more efficiently

    with word processors , for example, when using keyboard

    shortcuts than with the mouse. Generalising this

    observation to 3 dimensions, our aim was to move away

    from the simple point&click paradigm to a more compact

    way of interaction. Therefore, we explore how hand

    gestures could be used to interact with VR applications in

    the form of a simple sign language.

    In gesture recognition, it is more common to use a camera

    in combination with an image recognition system [2].

    These systems have the disadvantage that the

    image/gesture recognition is very sensitive to

    illumination, hand position, hand orientation etc. In order

    to circumnavigate these problems we decided to use a dataglove as input device.

    Problem DescriptionThe problem we faced was to find a way to map a set of

    angular measurements as delivered by the data glove to a

    set of pre-defined hand gestures. Furthermore, it would be

    advantageous to have a system with a certain amount of

    flexibility, so that the same system could be used by

    different people.

    MethodsIn our experiments, we used the CyberGlove, distributed

    by Virtual Technologies Inc. [3], which measures the

    angles of 18 joints of the hand: two for each finger, one

    each for the angles between neighbouring fingers, as well

    as one each for thumb rotation, palm arch, wrist pitch,

    and wrist yaw. To design and train the neural networks

    we used the Stuttgart Neural Network Simulator [4], a

    free software package. SNNS also provides a tool which

    can convert a trained network to a C-code module which

    can subsequently be included in an application.

    For our experiments we chose a set of 20 static handgestures such as fist, index finger, gun, and

    victory sign. Accordingly, each neural network model

    had 18 input and 20 output nodes. The experiments were

    performed with three standard three-layered back-

    propagation networks using the logistic function

    f(neti )= 1 (1+exp(neti)at each layer l with

    neti = wijojl1

    j

  • 8/13/2019 10.1.1.86.2333

    2/4

    and with oj

    l 1 denoting the output of the units of the

    previous layer.

    Learning was performed with a constant learning rate of

    = 0.2. For more information on back-propagation, see[5] or [6].

    We collected a pattern set of 200 hand gestures from one

    person which we divided into a training set of 140

    patterns and a test set of 60 patterns.

    The structure of these networks can be described as

    follows:

    (i) Network BPfull : all hidden units (30 units) are fully

    connected to all input units

    (ii) Network BPpair : each hidden unit is connected to the

    input units corresponding to the measurements of two

    fingers ("finger pairs"). Since we treat the measurements

    of thumb rotation, palm arch, wrist pitch and wrist yaw as

    measurements of a sixth finger, this amounts to 15 unitsin the hidden layer.

    (iii) Network BPtripleeach hidden unit is connected to the

    input units corresponding to the measurements of finger

    triples, which again leads to 15 hidden units.

    The idea behind the architectures of BPpair and BPtriple is

    to exploit a (tentative) correlation between gestures and

    finger combinations.

    In all networks, all hidden units are fully connected to all

    output units, each of which is responsible for recognising

    a particular gesture (see Fig. 1).

    Hand and Wrist Middlefinger Thumb

    FistIndexGunOK

    Fig. 1 : Structure of the finger pair network. The nodes

    on the input layer are grouped by fingers. Each node of

    the hidden layer receives its input from exactly two finger

    node groups. Each output node receives its input from all

    nodes of the hidden layer. For clarity not all nodes and

    connections are shown.

    The recognised gesture is determined in a winner-takes-

    all fashion, if at least one output unit exceeds the

    (experimentally determined) threshold value = 0.8;otherwise the pattern is classified unknown.

    First ResultsThe first network, BPfullperformed quite poorly (< 10%),

    whereas the BPpair and BPtriple yielded high recognition

    rates of 99.5% and 92.0%, respectively, on the test set.

    If a gesture recognition system is to be used in a

    productive way, it must be flexible enough so that

    different people can use it without having to go through a

    tedious data collection and training session. Obviously,

    the particular recognition rate depends significantly on the

    test person's hand geometry.

    To get a better idea of the generalisation capabilities of

    such networks, we took training and test sets from 5

    different persons. Again, all of the training sets consisted

    of 140 patterns. In a first experiment we trained 5

    networks (based on the finger pair structure) with the 5

    training sets and checked the recognition rate of each

    network on each of the 5 test sets. For this and the

    following experiments we restricted ourselves to the

    network architecture BPpair. The results are shown in the

    following table:

    Table 1 Net

    A

    Net

    B

    Net

    C

    Net

    D

    Net

    E

    Test Set A 1.00 0.82 0.98 0.95 0.67Test Set B 0.92 1.00 0.90 0.88 0.80

    Test Set C 0.87 0.93 0.98 0.97 0.77

    Test Set D 0.85 0.90 0.88 1.00 0.67

    Test Set E 0.78 0.77 0.75 0.77 0.98

    As can be seen, the recognition rate for the own test set

    is practically 100%, the exceptions being Net C and Net

    D. The recognition rate for alien test sets strongly varies

    between 67% and 98%; in most cases it is higher than

    85%. These results seemed to indicate the possibility of

    training a net in such a way, that the gestures of any

    person will be recognised with an acceptable accuracy.

    Combined Training SetsIn the next experiment we merged several combinations of

    the original 5 training sets into new training sets. The

    following table shows the recognition rates of five

    networks trained with combinations of 4 training sets

    each. In this table, Net A denotes a net which has

  • 8/13/2019 10.1.1.86.2333

    3/4

    been trained with a combination of training sets from the

    persons B, C, D, and E but not A.

    Table 2 Net

    A

    Net

    B

    Net

    C

    Net

    D

    Net

    E

    Test Set A 1.00 1.00 1.00 1.00 1.00

    Test Set B .98 0.98 1.00 0.98 1.00Test Set C 1.00 1.00 1.00 1.00 1.00

    Test Set D 1.00 0.98 0.98 0.97 1.00

    Test Set E 1.00 1.00 1.00 1.00 0.88

    A further net, trained with a combination of all five

    training sets, scored extremely well on the test sets. With

    the exception of test set B, for which the recognition rate

    was 98.3%, it showed a 100% recognition rate.

    Of course we are aware that the data set we used is too

    small to permit significant statements about such a nets

    performance for all possible hand geometries. However,we believe the results achieved so far are encouraging.

    Nevertheless it is conceivable that the combined net cant

    cope with the gestures of a user with a hand geometry

    radically differing from those used to create the training

    sets. Therefore, it would be interesting to look at systems

    whose parameters could be changed at runtime.

    Radial Basis FunctionsRadial-basis function (RBF) networks consist of an input

    and an output layer in which each output unit is fully

    connected to all input units.

    Each output unit oj

    maintains an N-dimensional vector

    r

    cj

    (with N representing the number of input units), which

    represents the centre of a Gaussian bump. Each output

    unit first calculates the distance

    dj= (ci

    jui)

    2

    i=1

    n

    of its centre to the current point of the input activation

    denoted by the ui 's. It then determines its activation

    act(oj

    )= exp(dj

    / )

    with denoting a scaling factor. Further details on RBFnetworks can be found in [6]

    For our experiments with radial basis function systems we

    employed the same training and test sets as were used for

    the backpropagation networks. As scaling factor weused the value 1.0. In the next table the recognition rates

    of 5 simple RBFs (i.e. RBFs trained with the training set

    of one person each) are shown:

    Table 3 RBF

    A

    RBF

    B

    RBF

    C

    RBF

    D

    RBF

    E

    Test Set A 0.98 0.35 0.53 0.65 0.33

    Test Set B 0.53 0.95 0.51 0.60 0.35

    Test Set C 0.70 0.48 0.98 0.55 0.40

    Test Set D 0.66 0.53 0.53 1.00 0.46

    Test Set E 0.45 0.31 0.43 0.53 0.98

    It can be seen that the generalisation capabilities of the

    simple RBFs are somewhat inferior to those of the simply

    trained back-propagation networks.

    However, by training RBFs with combinations of training

    sets, we can achieve generalisation capabilities similar to

    those of the back-propagation networks trained with

    combined training sets. As was the case in table 2, RBF

    A denotes a RBF whose training set is a combination of

    the training sets B, C, D, and E, but not A.

    Table 4 RBF

    A

    RBF

    B

    RBF

    C

    RBF

    D

    RBF

    E

    Test Set A 0.91 0.99 1.00 0.99 1.00

    Test Set B 0.98 0.86 0.99 0.99 0.99

    Test Set C 0.99 1.00 0.94 0.99 0.99

    Test Set D 0.99 0.99 0.99 0.97 0.98

    Test Set E 0.96 0.93 0.96 0.96 0.72

    The advantage of employing RBFs lies in the fact that

    RBFs can be easily retrained at run time due to their

    linear character. This means a gesture recognition system

    based on RBFs could be adaptively retrained if it

    encounters a user whose hand geometry differs strongly

    from those in the training sets.

    Applications and Future WorkIn order to demonstrate the usability of a sign language as

    a means of controlling a program we incorporated our

    gesture recognition system into a simple virtual reality

    application. This application consists of some objects in a

    3-dimensional space and a robot hand (see Fig. 2). We

    assigned simple commands such as move robot hand

    forward, rotate robot hand about x-axis, or grab

    object to some of the gestures. With a small learningeffort, it is possible to effectively navigate in the virtual

    world and manipulate objects therein.

  • 8/13/2019 10.1.1.86.2333

    4/4

    Fig. 2 : The Test Application. The gesture-controlled

    robot hand is just about to grab an object in virtual

    space.

    We are currently working on the integration of our system

    as a means of interaction in a number of virtual reality

    applications developed at the University of Zurich, such as

    a virtual endoscopy application and a geographical

    information system.

    In the future, we are planning to continue our work in the

    following directions:

    Exploiting the adaptive possibilities of RBF-basedsystems: This would enable run time changes to the system,thus enabling retraining of systems to the gestures of

    a new user with unrecognizable gestures. Recognition of dynamic gestures: Gestures such as waving or wagging a finger can

    make a sign language much more intuitive. In order

    to correctly recognize dynamic gestures the data glove

    must be equipped with a tracking device such as the

    Ascension Flock of Birds [7] or Polhemus

    Fastrak [8], in order to provide the system with

    positional and orientational information. Use of both hands: In VR applications where a particular gesture of the

    right hand, such as extended index finger, isassigned the command move forward, gestures of

    the left hand could be used as modifiers to regulate

    the speed. Recognition of gesture sequences: Here the problem lies in detecting and eliminating

    unwanted intermediate gestures. If, for instance, the

    gesture thumbs up is followed by the gesture

    extended index finger, the gesture gun (extended

    index fingerplus thumb) might unintenionally be

    formed during the transition.The application of a gesture recognition system as

    described in this paper must not necessarily be restricted

    to VR programs; once the points mentioned above havebeen solved, it would, for example, also open up the

    possibility of building a system for the translation of ASL

    (American Sign Language) to spoken English

    ConclusionThis paper demonstrates that the chosen combination of

    data glove and neural networks achieves high recognition

    rates on a set of predefined gestures. Therefore it can be

    considered as being a first step towards VR applications or

    other types of applications in which program execution is

    controlled by means of a sign language.

    AcknowledgementsThis work is supported in part by the Swiss National

    Science Foundation grant #21-50684.97

    References[1] G. dYdewalle et al., Graphical versus Character-Based

    Word Processors: An Analysis of User Performance, Behaviour

    and Information Technology, 1995 v.14 n.4 p.208-214

    [2] R. Kieldsen, J. Kender, Toward the Use of Gesture in

    Traditional User Interfaces, Proceedings of the Second

    International Conference on Automatic Face and Gesture

    Recognition, 1996, p.151-156

    [3] Virtual Technologies Inc., Palo Alto, CA 94306

    Production and ditribution of data gloves and related devices.

    www.virtex.com

    [4] Web site for the Stuttgart Neural Network Simulator :

    www-ra.informatik.uni-tuebingen.de/SNNS/

    [5] J. Hertz, A. Krogh, R. Palmer,Introduction to the Theory of

    Neural Computation, Santa Fe Institute in the sciences of

    complexity; Lecture notes v.1, Addison Wesley

    [6] R. Rojas,Neural Networks: A systematic Introduction,Springer-Verlag, Berlin (1996).

    [7] Ascension Technology Corporation

    www.ascension-tech.com

    [8] Polhemus Incorporated

    www.polhemus.com