Phase ii frontpage

YELLAPU MADHURI

AUTOMATIC LANGUAGE TRANSLATION SOFTWARE FOR AIDING COMMUNICATION BETWEEN INDIAN SIGN

LANGUAGE AND SPOKEN ENGLISH USING LABVIEW

A PROJECT REPORT

Submitted in partial fulfillment for the award of the degree of

MASTER OF TECHNOLOGY

in

BIOMEDICAL ENGINEERING

by

YELLAPU MADHURI (1651110002)

Under the Guidance of

Ms. G. ANITHA, (Assistant Professor)

DEPARTMENT OF BIOMEDICAL ENGINEERING

SCHOOL OF BIOENGINEERING FACULTY OF ENGINEERING & TECHNOLOGY

SRM UNIVERSITY (Under section 3of UGC Act, 1956) SRM Nagar, Kattankulathur-603203

Tamil Nadu, India

MAY 2013

YELLAPU MADHURI

2

ACKNOWLEDGEMENT

First and foremost, I express my heartfelt and deep sense of gratitude to our Chancellor

Shri. T. R. Pachamuthu, Shri. P. Ravi Chairman of the SRM Group of Educational Institutions,

Prof. P.Sathyanarayanan, President, SRM University, Dr. R.Shivakumar, Vice President, SRM

University, Dr. M.Ponnavaikko, Vice Chancellor, for providing me the necessary facilities for the

completion of my project. I also acknowledge Registrar Dr. N. Sethuraman for his constant support

and endorsement.

I wish to express my sincere gratitude to our Director (Engineering & Technology) Dr. C.

Muthamizhchelvan for his constant support and encouragement.

I am extremely grateful to the Head of the Department Dr. M. Anburajan for his invaluable

guidance, motivation, timely and insightful technical discussions. I am immensely grateful for his

constant encouragement, smooth approach throughout our project period and make this work

possible.

I am indebted to my Project Co-coordinators Mrs. U.Snekhalatha and Mrs. Varshini

Karthik for their valuable suggestions and motivation. I am deeply indebted to my Internal Guide

Ms. G. Anitha, and faculties of Department of Biomedical Engineering for extending their warm

support, constant encouragement and ideas they shared with us.

I would be failing in my part if I do not acknowledge my family members and my friends

for their constant encouragement and support.

YELLAPU MADHURI

3

BONAFIDE CERTIFICATE

This is to certify that the Project entitled "AUTOMATIC LANGUAGE TRANSLATION

SOFTWARE FOR AIDING COMMUNICATION BETWEEN INDIAN SIGN LANGUAGE

AND SPOKEN ENGLISH USING LABVIEW" has been carried out by YELLAPU

MADHURI-1651110002 under the supervision of Ms. G. Anitha in partial fulfillment of the

degree of MASTER OF TECHNOLOGY in Biomedical Engineering, School of Bioengineering,

SRM University, during the academic year 2012-2013(Project Work Phase –II, Semester -IV). The

contents of this report, in full or in parts, have not been submitted to any institute or university for

the award of any degree or diploma.

Signature Signature HEAD OF THE DEPARTMENT INTERNAL GUIDE (Dr. M. Anburajan) (Ms. G. Anitha) Department of Biomedical Engineering, Department of Biomedical Engineering, SRM University, SRM University, Kattankulathur – 603 203. Kattankulathur – 603 203. INTERNAL EXAMINER EXTERNAL EXAMINER

YELLAPU MADHURI

4

ABSTRACT

This paper presents SIGN LANGUAGE TRANSLATION software for automatic translation

of Indian sign language into spoken English and vice versa to assist the communication between

speech and/or hearing impaired people and hearing people. It could be used by deaf community as

a translator to people that do not understand sign language, avoiding by this way the intervention

of an intermediate person for interpretation and allow communication using their natural way of

speaking. The proposed software is standalone executable interactive application program

developed using LABVIEW software that can be implemented in any standard windows operating

laptop, desktop or an IOS mobile phone to operate with the camera, processor and audio device.

For sign to speech translation, the one handed Sign gestures of the user are captured using

camera; vision analysis functions are performed in the operating system and provide

corresponding speech output through audio device. For speech to sign translation the speech input

of the user is acquired by microphone; speech analysis functions are performed and provide sign

gesture picture display of corresponding speech input. The experienced lag time for translation is

little because of parallel processing and allows for instantaneous translation from finger and hand

movements to speech and speech inputs to sign language gestures. This system is trained to

translate one handed sign representations of alphabets (A-Z), numbers (1-9) to speech and 165

word phrases to sign gestures The training database of inputs can be easily extended to expand the

system applications. The software does not require the user to use any special hand gloves. The

results are found to be highly consistent, reproducible, with fairly high precision and accuracy.

YELLAPU MADHURI

5

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT I

LIST OF FIGURES IV

LIST OF ABBREVIATIONS VI

1 INTRODUCTION 1

1.1 HEARING IMPAIRMENT 2

1.2 NEED FOR THE SYSTEM 7

1.3 AVAILABLE MODELS 8

1.4 PROBLEM DEFINITION 8

1.5 SCOPE OF THE PROJECT 9

1.6 FUTURE PROSPECTS 10

1.7 ORGANISATION OF REPORT 10

2 AIM AND OBJECTIVES

OF THE PROJECT 11

2.1 AIM 11

2.2 OBJECTIVES 11

3 MATERIALS AND METHODOLOGY 12

3.1 SIGN LANGUAGE TO SPOKEN 13

ENGLISH TRANSLATION

3.2 SPEECH TO SIGN LANGUAGE 21

TRANSLATOR

YELLAPU MADHURI

6

4 RESULTS AND DISCUSSIONS 32

4.1 RESULTS 32

4.2 DISCUSSIONS 43

5 CONCLUSIONS AND FUTURE ENHANCEMENTS 48

5.1 CONCLUSIONS 48

5.2 FUTURE ENHANCEMENT 49

REFERENCES 52

YELLAPU MADHURI

7

LIST OF FIGURES

FIGURE PAGE NO.

1.1 Anatomy of human ear 3

1.2 Events involved in hearing 3

1.3 Speech chain 4

1.4 Block diagram of speech chain 4

3.1 Graphical abstract 12

3.2 Flow diagram of template preparation 16

3.3 Flow diagram of pattern matching 19

3.4 Block diagram of sign to speech translation 21

3.5 Flow diagram speech to sign translation 25

3.6 Block diagram speech to sign translation 24

3.7 Speech recognizer tutorial window 25

4.1 Application Installer 32

4.2 Application window 33

4.3 GUI of Speech to Sign translation 34

4.4 Speech recognizer in sleep mode 36

4.5 Speech recognizer in active mode 36

4.6 Speech recognizer when input speech is not clear for recognition 36

4.7 GUI of working window of speech to sign translation 37

4.8 Block diagram of speech to sign translation 37

YELLAPU MADHURI

8

4.9 GUI of template preparation 38


4.11 GUI of working window of template preparation 39

4.12 GUI of sign to speech translation 40

4.13 GUI of working window of sign to speech translation 41


4.15 Block diagram of pattern matching 42

4.16 Data base of sign templates 46

4.17 Data base of sign number templates 47

YELLAPU MADHURI

9

LIST OF ABBREVIATIONS

Sr.No ABBREVIATION EXPANSION

1 SL Sign language

2 BII Bahasa Isyarat India

3 SLT Sign language translator

4 ASLR Automatic sign language recognition

5 ASLT Automatic sign language translation

6 GSL Greek Sign Language

7 SDK Software development kit

8 RGB Red green blue

9 USB Universal serial bus

10 CCD Charge couple display

11 ASL American sign language

12 ASR Automatic sign recognition

13 HMM Hidden Markov model

14 LM Language model

15 OOV Out of vocabulary

YELLAPU MADHURI

10

1. INTRODUCTION

In India there are around 60 million people with hearing deficiencies. Deafness brings about

significant communication problems: most deaf people have serious problems when expressing

themselves in these languages or understanding written texts. This fact can cause deaf people to

have problems when accessing information, education, job, social relationship, culture, etc. It is

necessary to make a difference between “deaf” and “Deaf”: the first one refers to non-hearing

people, and the second one refers to non-hearing people who use a sign language to communicate

between themselves (their mother tongue), making them part of the “Deaf community”. Sign

language is a language through which communication is possible without the means of acoustic

sounds. Instead, sign language relies on sign patterns, i.e., body language, orientation and

movements of the arm to facilitate understanding between people. It exploits unique features of the

visual medium through spatial grammar. Sign languages are fully-fledged languages that have a

grammar and lexicon just like any spoken language, contrary to what most people think. The use of

sign languages defines the Deaf as a linguistic minority, with learning skills, cultural and group

rights similar to other minority language communities.

Hand gestures can be used for natural and intuitive human-computer interaction for

translating sign language to spoken language to assist communication of deaf community with non

sign language users. To achieve this goal, computers should be able to recognize hand gestures

from input. Vision-based gesture recognition can achieve an improved interaction, more intuitive

and flexible for the user. However vision-based hand tracking and gesture recognition is an

extremely challenging problem due to the complexity of hand gestures, which are rich in diversities

due to high degrees of freedom involved by the human hand. On the other hand, computer vision

algorithms are notoriously brittle and computation intensive, which make most current gesture

recognition systems fragile and inefficient. This report proposes a new architecture to solve the

problem of real-time vision-based hand tracking and gesture recognition. To recognize different

hand postures, a parallel cascades structure is implemented. This structure achieves real-time

performance and high translation accuracy. The 2D position of the hand is recovered according to

the camera’s perspective projection. To make the system robust against cluttered backgrounds,

YELLAPU MADHURI

11

background subtraction and noise removal are applied. The overall goal of this project is to develop

a new vision-based technology for recognizing and translating continuous sign language to spoken

English and vice-versa.

1.1 HEARING IMPAIRMENT

Hearing is one of the major senses and is important for distant warning and communication.

It can be used to alert, to communicate pleasure and fear. It is a conscious appreciation of vibration

perceived as sound. In order to do this, the appropriate signal must reach the higher parts of the

brain. The function of the ear is to convert physical vibration into an encoded nervous impulse. It

can be thought of as a biological microphone. Like a microphone the ear is stimulated by vibration:

in the microphone the vibration is transduced into an electrical signal, in the ear into a nervous

impulse which in turn is then processed by the central auditory pathways of the brain. The

mechanism to achieve this is complex.

The ears are paired organs, one on each side of the head with the sense organ itself, which is

technically known as the cochlea, deeply buried within the temporal bones. Part of the ear is

concerned with conducting sound to the cochlea; the cochlea is concerned with transducing

vibration. The transduction is performed by delicate hair cells which, when stimulated, initiate a

nervous impulse. Because they are living, they are bathed in body fluid which provides them with

energy, nutrients and oxygen. Most sound is transmitted by a vibration of air. Vibration is poorly

transmitted at the interface between two media which differ greatly in characteristic impedance.

The ear has evolved a complex mechanism to overcome this impedance mismatch, known as the

sound conducting mechanism. The sound conducting mechanism is divided into two parts, an outer

and the middle ear, an outer part which catches sound and the middle ear which is an impedance

matching device. Sound waves can be distinguished from each other by means of the differences in

their frequencies and amplitudes. For people suffering from any type of deafness, these differences

cease to exist. The anatomy of the ear and the events involved in hearing process are shown in

figure 1.1 and figure 1.2 respectively.

YELLAPU MADHURI

12

Figure 1.1 Anatomy of human ear

Figure 1.2 Events involved in hearing

YELLAPU MADHURI

13

Figure 1.3 Speech chain

Figure 1.4 Block diagram of speech chain

YELLAPU MADHURI

14

1.1.1 THE SPEECH SIGNAL

While you are producing speech sounds, the air flow from your lungs first passes the glottis

and then your throat and mouth. Depending on which speech sound you articulate, the speech

signal can be excited in three possible ways:

• VOICED EXCITATION

The glottis is closed. The air pressure forces the glottis to open and close periodically thus

generating a periodic pulse train (triangle–shaped). This ”fundamental frequency” usually lies in

the range from 80Hz to 350Hz.

• UNVOICED EXCITATION

The glottis is open and the air passes a narrow passage in the throat or mouth. This results in

a turbulence which generates a noise signal. The spectral shape of the noise is determined by the

location of the narrowness.

• TRANSIENT EXCITATION

A closure in the throat or mouth will raise the air pressure. By suddenly opening the closure

the air pressure drops down immediately. (”plosive burst”) With some speech sounds these three

kinds of excitation occur in combination. The spectral shape of the speech signal is determined by

the shape of the vocal tract (the pipe formed by your throat, tongue, teeth and lips). By changing

the shape of the pipe (and in addition opening and closing the air flow through your nose) you

change the spectral shape of the speech signal, thus articulating different speech sounds.

An engineer looking at (or listening to) a speech signal might characterize it as follows:

• The bandwidth of the signal is 4 kHz

• The signal is periodic with a fundamental frequency between 80 Hz and 350 Hz

YELLAPU MADHURI

15

• There are peaks in the spectral distribution of energy at (2n − 1) ∗ 500 Hz ; n = 1, 2, 3, . . . (1.1)

• The envelope of the power spectrum of the signal shows a decrease with increasing frequency

(-6dB per octave).

1.1.2 CAUSES OF DEAFNESS IN HUMANS

Many speech and sound disorders occur without a known cause. Some speech- sound errors

can result from physical problems such as

• Developmental disorders

• Genetic syndromes

• Hearing loss

• Illness

• Neurological disorders

Some of the major types are listed below.

• Genetic Hearing Loss

• Conductive Hearing Loss

• Perceptive Hearing Loss

• Pre-Lingual Deafness

• Post-Lingual Deafness

• Unilateral Hearing Loss

1. In some cases, hearing loss or deafness is due to hereditary factors. Genetics is considered to

play a major role in the occurrence of sensory neural hearing loss. Congenital deafness can

happen due to heredity or birth defects.

YELLAPU MADHURI

16

2. Causes of Human deafness include continuous exposure to loud noises. This is commonly

observed in people working in construction sites, airports and nightclubs. This is also

experienced by people working with firearms and heavy equipment, and those who use music

headphones frequently. The longer the exposure, the greater is the chance of getting affected by

hearing loss and deafness.

3. Some diseases and disorders can also be a contributory factor for deafness in humans. This

includes measles, meningitis, some autoimmune diseases like Wegener's granulomatosis,

mumps, presbycusis, AIDS and Chlamydia. Fetal alcoholic syndrome, developed in babies

born to alcoholic mothers, can cause hearing loss in infants. Growing adenoids can also cause

hearing loss by obstructing the Eustachian tube. Otosclerosis, which is a disorder of the middle

ear bone, is another cause of hearing loss and deafness. Likewise, there are many other medical

conditions which can cause deafness in humans.

4. Some medications are also considered to be the cause of permanent hearing loss in humans,

while others can lead to deafness which can be reversed. The former category includes

medicines like gentamicin and the latter includes NSAIDs, diuretics, aspirin and macrolide

antibiotics. Narcotic pain killer addiction and heavy hydrocodone abuse can also cause

deafness.

5. Human deafness causes include exposure to some industrial chemicals as well. These ototoxic

chemicals can contribute to hearing loss if combined with continuous exposure to loud noise.

These chemicals can damage the cochlea and some parts of the auditory system.

6. Sometimes, loud explosions can cause deafness in humans. Head injury is another cause for

deafness in humans.

The above are some of the common causes of deafness in humans. There can be many other

reasons which can lead to deafness or hearing loss in humans. It is always advisable to protect the

ears from trauma and other injuries, and to wear protective gear in workplaces, where there are

continuous heavy noises.

1.2 NEED FOR THE SYSTEM

YELLAPU MADHURI

17

Deaf communities revolve around sign languages as they are their natural means of

communication. Although deaf, hard of hearing and hearing signers can communicate without

problems amongst themselves, there is a serious challenge for the deaf community in trying to

integrate into educational, social and work environments. An important problem is that there are

not enough sign-language interpreters. In India, there are 60 million Deaf people (who use a sign

language), although there are more people with hearing deficiencies, but only 7000 sign-language

interpreters, i.e. a ratio of 893 deaf people to 1 interpreter. This information shows the need to

develop automatic translation systems with new technologies for helping hearing and Deaf people

to communicate between themselves.

1.3 AVAILABLE MODELS

Previous approaches have focused on recognizing mainly the hand alphabet which is used to

finger spell words and complete signs which are formed by dynamic hand movements. So far body

language and facial expressions have been left out. Hand gesture recognition can be achieved by

two ways: video-based and instrumented.

The video based systems allow the signer to move freely without any instrumentation

attached to the body. The hand shape, location and movement are recognized by cameras. But the

signer is constrained to sign in a controlled environment. The amount of data to be processed in the

image imposes a restriction on memory, speed and complexity on the computer equipment.

For instrumented approaches require sensors to be placed on signer’s hands. They are

restrictive and cumbersome but more successful in recognizing hand gestures than video based

approaches.

1.4 PROBLEM DEFINITION

Sign language is very complex with many actions taking place both sequentially and

simultaneously. Existing translators are bulky, slow and not precise due to the heavy parallel

processing required. The cost of these translators is usually very high due to the hardware required

to meet the processing demands. There is an urgent requirement for a simple, precise and

YELLAPU MADHURI

18

inexpensive system that helps to bridge the gap for normal people who do not know sign language

and deaf persons who communicate through sign language, who are unfortunately in significantly

large number in a country such as India.

In this project, the aim is to detect single-hand gestures in two dimensional space, using a

vision based system and speech input through microphone. The selected features should be as

small as possible in number, invariant to input errors like vibrating hand, small rotation, scale,

pitch and voice which may vary from person to person or with different input devices and provide

audio output through speaker and visual output on display device. The acceptable delay of the

system is the end of each gesture, meaning that the pre-processing should be in real-time. One of

our goals in the design and development of this system is scalability in detecting a reasonable

number of gestures and words and the ability to add new gestures and words in the future.

1.5 SCOPE OF THE PROJECT

The developed software is a stand alone application. It can be installed in any standard PC or

IOS phone and implemented. It can be used in a large variety of environments like shops,

governmental offices and also for the communication between a deaf user and information systems

like vendor machines or PCs. Below are the scopes that to be proposed for this project.

For sign to speech translation:

i. To develop an image acquisition system that automatically acquires images when triggered, for a

fixed interval of time or when the gestures are present.

ii. To develop a set of definition of gestures and processes of filtration, effect and function

available.

iii. To develop a pre-defined gestures algorithm that command computer to do playback function of

audio model.

iv. To develop a testing system that proceeds to command if the condition is true with the

processed images.

vi. To develop a simple Graphical User Interface for input and indication purposes.

YELLAPU MADHURI

19

For speech to sign translation:

i. To develop a speech acquisition system that automatically acquires speech input when triggered,

for a fixed interval of time or when the gestures are present.

ii. To develop a set of definition of phonemes and processes of filtration, effect and function

available.

iii. To develop a pre-defined phonetics algorithm that command computer to display function of

sign model.

iv. To develop a testing system that proceeds to command if the condition is true with the

processed phonemes.

vi. To develop a simple Graphical User Interface for input and indication purposes.

1.6 FUTURE PROSPECTS

For sign language to spoken English translation, the software is able to translate only static

signs to spoken English currently. It can be extended to translate dynamic signs. Also facial

expressions and body language can be tracked and considered which improves the performance of

the sign language to spoken English translation. For spoken English to sign language translation,

the system can be made user voice specific to eliminate the system response to non user.

1.7 ORGANISATION OF REPORT

This report is composed of 6 chapters each will give of details upon every aspect of this

project. The beginning of this report will explain on what foundation the system to be built on. This

includes Chapter 1 as the introduction of the whole report. The preceding chapter 2 will contain

discussion on related work. Next, chapter 3 will provide the aim and objective of the project.

Chapter 4 will explain the materials and methodology followed to achieve the aim and objective of

the project and have a complete setup application. This chapter starts with overview on key

component of software, hardware and how both should cooperate. Then it is followed with a

further look on the overall system built. These topics will detail out everything under the interest of

the system. The chapter 5 starts with results and discussion of the system and its performance with

YELLAPU MADHURI

20

the results of various stages of implementation. This report will properly be concluded in the last

Chapter 6. The conclusion discusses briefly what the proposed system has accomplished and

provides an outlook for future work recommended for the extension of this project and future

prospect for the development and improvement to grow on.

2. AIM AND OBJECTIVES OF THE PROJECT

2.1. AIM

To develop a mobile interactive application program for automatic translation of Indian sign

language into spoken English and vice-versa to assist the communication between Deaf people and

hearing people. The sign language translator should be able to translate one handed Indian Sign

language finger spelling input of alphabets (A-Z) and numbers (1-9) to spoken English audio

output and 165 spoken English word input to Indian Sign language picture display output.

2.2. OBJECTIVES

• To acquire one hand finger spelling of alphabets (A to Z) and numbers (1 to 9) to produce

spoken English audio output.

• To acquire spoken English word input to produce Indian Sign language picture display output.

• To create an executable file to make the software a standalone application.

• To implement the software and optimize the parameters to improve the accuracy of translation.

• To minimize hardware requirements and thus expense while achieving high precision of

translation.

YELLAPU MADHURI

21

3. MATERIALS AND METHODOLOGY

This chapter will be dedicated to explain the system in great details from setting up, to the

system component to the output. The software is developed in Virtual Instrumentation Labview

platform. The software consists of two main parts namely Sign language to speech translation and

speech to Sign language translation. The software can be implemented using a standard laptop,

desktop or an IOS mobile phone to operate with the camera, processor and audio device.

YELLAPU MADHURI

22

Figure 3.1 Graphical abstract

The software consists of four modules that can be implemented from a single window. The

necessary steps to implement these modules from a single window are explained in detail below.

3.1 SIGN LANGUAGE TO SPOKEN ENGLISH TRANSLATION

The sign language to spoken English translation is achieved using pattern matching

technique. The complete interactive section can be considered to be comprised of two layers:

detection and recognition. The detection layer is responsible for defining and extracting visual

features that can be attributed to the presence of hands in the field of view of the camera. The

YELLAPU MADHURI

23

recognition layer is responsible for grouping the spatiotemporal data extracted in the previous

layers and assigning the resulting groups with labels associated to particular classes of gestures.

3.1.1 DETECTION

The primary step in gesture recognition systems is the detection of hands and the

corresponding image regions. The step is crucial because it isolates the task-relevant data from the

image background, before passing them to the subsequent tracking and recognition stages. A large

number of methods have been proposed in the literature that utilize a several types of visual

features and, in many cases, their combination. Such features are skin color, shape, motion and

anatomical models of hands are used. Several color spaces have been proposed including RGB,

normalized RGB, HSV, YCrCb, YUV, etc. Color spaces efficiently separating the chromaticity

from the luminance components of color are typically considered preferable. This is due to the fact

that by employing chromaticity-dependent components of color only, some degree of robustness to

illumination changes can be achieved. Template based detection is used here. Members of this class invoke the hand detector at the

spatial vicinity that the hand was detected in the previous frame, so as to drastically restrict the

image search space. The implicit assumption for this method to succeed is that images are acquired

frequently enough. The proposed technique is explained in the following intermediate steps

namely; image acquisition, image processing, template preparation and pattern recognition.

3.1.2 IMAGE ACQUISITION

The software is installed in any supporting operating system with access to camera and

microphone. After installing the executable file, follow the instructions that appear on the

Graphical User Interface (GUI) and execute the program. The program allows the user to choose

the camera. All the cameras which are allowed to access through the operating system, either

inbuilt camera or connected externally appear in the selection list. After choosing the camera, the

software sends commands to the camera to capture the gestures of sign language performed by the

user. Image acquisition process is subjected to many environmental concerns such as the position

YELLAPU MADHURI

24

of the camera, lighting sensitivity and background condition. The camera is placed to focus on an

area that can capture the maximum possible movement of the hand and take into account the

difference in height of individual signers. Sufficient lighting is required to ensure that the acquired

image is bright enough to be seen and analyzed. Capturing thirty frames per second (fps) is found

to be sufficient. Higher fps will only lead to higher computation time of the computer as more input

data to be processed. As the acquisition process runs at real time, this part of the process has to be

efficient. The acquired images are then processed. The previous frame that has been processed will

be automatically deleted to free the limited memory space in the buffer.

3.1.3 IMAGE PROCESSING

The captured images are processed to identify the unique features of each sign. Image

processing enhances the features of interest for recognition of the sign. The camera captures

images at 30 frames per second. At this rate, the difference between subsequent images will be too

small. Hence, the images are sampled at 5 frames per second. In the program, one frame is saved

and numbered sequentially every 200 milliseconds so that the image classifying and processing can

be done systematically. The position of the hand is monitored. The image acquisition runs

continuously until the acquisition is stopped. The image processing involves performing

morphological operations on the input images to enhance the unique features of each sign. As the

frames from acquisition are read one by one, they are subjected to extraction of single color plane

of luminance.

3.1.4 TEMPLATE PREPARATION

The images to be used for pattern matching as templates are prepared using the following

procedure and are saved in a folder to be later used in the pattern matching.

1. Open Camera

Open the camera, query the camera for its capabilities, load the camera configuration file, and

create a unique reference to the camera.

2. Configure Acquisition

YELLAPU MADHURI

25

Configure a low-level acquisition previously opened with IMAQdx Open Camera VI. Specify

the acquisition type with the Continuous and Number of Buffers parameters. Snap: Continuous

= 0; Buffer Count = 1

Sequence: Continuous = 0; Buffer Count > 1

Grab: Continuous = 1; Buffer Count > 1

3. Start Acquisition

Start an acquisition that was previously configured with the IMAQdx Configure Acquisition.

4. Create

Create a temporary memory location for an image.

5. Get Image

Acquire the specified frame into Image Out. If the image type does not match the video format

of the camera, this VI changes the image type to a suitable format.

6. Extract Single Color Plane

Extract a single plane from the color image.

7. Setup Learn Pattern

Sets parameters used during the learning phase of pattern matching.

8. Learn Pattern

Create a description of the template image for which you want to search during the matching

phase of pattern matching. This description data is appended to the input template image.

During the matching phase, the template descriptor is extracted from the template image and

used to search for the template in the inspection image.

YELLAPU MADHURI

26

.

Figure 3.2 Flow diagram of template preparation

9. Write File 2

YELLAPU MADHURI

27

Write the image to a file in the selected format.

10. Close Camera

Stop the acquisition in progress, release resources associated with the acquisition, and close the

specified Camera Session.

11. Merge Errors Function

Merge error I/O clusters from different functions. This function looks for errors beginning with

the error in 0 parameter and reports the first error found. If the function finds no errors, it looks

for warnings and returns the first warning found. If the function finds no warnings, it returns no

error.

3.1.5 IMAGE RECOGNITION

The last stage of sign language to spoken English translation is the recognition stage and

providing the audio output. The techniques used for feature extraction should find shapes reliably

and robustly irrespective of changes in illumination levels, position, orientation and size of the

object in a video. Objects in an image are represented as collection of pixels. For object recognition

we need to describe the properties of these groups of pixels. The description of an object is a set of

numbers called as object’s descriptors. Recognition is simply matching a set of shape descriptors

from a set of known descriptors. A usable descriptor should possess four valuable properties. The

descriptors should form a complete set, they should be congruent, rotation invariant and form a

compact set. Objects in an image are characterized by two forms of descriptors region and shape

descriptors. Region descriptors describe the arrangement of pixels within the object area whereas

shape descriptors are the arrangement of pixels in the boundary of the object.

Template matching, a fundamental pattern recognition technique, has been utilized for

gesture recognition. The template matching is performed by the pixel-by-pixel comparison of a

prototype and a candidate image. The similarity of the candidate to the prototype is proportional to

the total score on a preselected similarity measure. For the recognition of hand postures, the image

of a detected hand forms the candidate image which is directly compared with prototype images of

hand postures. The best matching prototype (if any) is considered as the matching posture.

The final stage of the system is classification of different signs and generating voice

YELLAPU MADHURI

28

messages corresponding to the correctly classified sign. The acquired images which are

preprocessed are read one by one and compared with template images saved in database for pattern

matching. The pattern matching parameters are threshold value setting for the maximum difference

between the input sign and the database, if the difference is below the maximum limit, a match is

found and the sign is recognized. Ideally it is set at 800. When an input image is matched with a

template image, then the pattern matching loop will stop. The audio corresponding to the loop

iteration value is used to play the audio through the inbuilt audio device. The necessary steps to

achieve sign language to speech translation are given below.

1. Invoke Node Invoke a method or action on a reference.

2. Default Values: Reinitialize All To Default Method Change the current values of all controls on the front panel to their defaults.

3. File Dialog Express

Displays a dialog box with which you can specify the path to a file or directory from existing

files or directories or to select a location and name for a new file or directory.

4. Open Camera Opens a camera, queries the camera for its capabilities, loads a camera configuration file, and

creates a unique reference to the camera.

5. Configure Grab Configures and starts a grab acquisition. A grab performs an acquisition that loops continually

on a ring of buffers.

6. Create

Create a temporary memory location for an image.

7. Grab

Acquire the most current frame into Image Out.

8. Extract Single Color Plane

Extract a single plane from a color image.

9. Setup Learn Pattern

Sets parameters used during the learning phase of pattern matching.

YELLAPU MADHURI

29

Figure 3.3 Flow diagram of pattern matching

YELLAPU MADHURI

30

10. Learn Pattern

Create a description of the template image for which you want to search during the matching

phase of pattern matching. This description data is appended to the input template image.

During the matching phase, the template descriptor is extracted from the template image and

used to search for the template in the inspection image.

11. Recursive File List

List the contents of a folder or LLB.

12. Unbundle By Name Function

Returns the cluster elements whose names you specify.

13. Read File

Read an image file. The file format can be a standard format (BMP, TIFF, JPEG, JPEG2000,

PNG, and AIPD) or a nonstandard format known to the user. In all cases, the read pixels are

converted automatically into the image type passed by Image.

14. Call Chain

Return the chain of callers from the current VI to the top-level VI. Element 0 of the call chain

array contains the name of the lowest VI in the call chain. Subsequent elements are callers of

the lower VIs in the call chain. The last element of the call chain array is the name of the top-

level VI.

15. Index Array

Return the element or sub-array of n-dimension array at index.

16. Format Into String

Formats string, path, enumerated type, time stamp, Boolean, or numeric data as text.

17. Read Image And Vision Info

Read an image file, including any extra vision information saved with the image. This includes

overlay information, pattern matching template information, calibration information, and

custom data, as written by the IMAQ Write Image and Vision Info File 2 instance of the IMAQ

Write File 2 VI.

18. Pattern Match Algorithm

Check for the presence of the template image in the given input image.

19. Speak Text

Call the .NET speech synthesizer to speak a string of text.

YELLAPU MADHURI

31

20. Dispose

Destroys an image and frees the space it occupied in memory. This VI is required for each

image created in an application to free the memory allocated to the IMAQ Create VI.

21. Simple Error Handler

Indicate whether an error occurred. If an error occurred, this VI returns a description of the

error and optionally displays a dialog box.

Figure 3.4 Block diagram of sign to speech translation

3.2 SPEECH TO SIGN LANGUAGE TRANSLATION

The speech input is acquired through the inbuilt microphone using windows speech

recognition software. The system recognizes the speech input phrases that are listed in the

database. Each phrase in the database is associated with a picture of sign language gesture. If the

input speech matches the database then a command is sent to display the corresponding gesture.

The necessary steps to achieve speech to sign language translation are given below.

YELLAPU MADHURI

32

1. Current VI’s path

Return the path to the file of the current VI.

2. Strip Path

Return the name of the last component of a path and the stripped path that leads to that

component.

3. Build path

Create a new path by appending a name (or relative path) to an existing path.

4. VI Server Reference

Return a reference to the current VI or application, to a control or indicator in the VI, or to a

pane. You can use this reference to access the properties and methods for the associated VI,

application, control, indicator, or pane.

5. Property Node

Get (reads) and/or set (writes) properties of a reference. Use the property node to get or set

properties and methods on local or remote application instances, VIs, and objects.

6. Speech Recognizer Initialize

The event rises when the current grammar has been used by the recognition engine to detect

speech and find one or more phrases with sufficient confidence levels.

7. Event Structure

Have one or more sub diagrams, or event cases, exactly one of which executes when the

structure executes. The Event structure waits until an event happens, then executes the

appropriate case to handle that event.

8. Read JPEG File VI

Read the JPEG file and create the data necessary to display the file in a picture control.

9. Draw Flattened Pixmap VI

Draw a 1-, 4-, or 8-bit pixmap or a 24-bit RGB pixmap into a picture.

10. 2D Picture Control

Include a set of drawing instructions for displaying pictures that can contain lines, circles, text,

and other types of graphic shapes.

11. Simple Error Handler VI

Indicate whether an error occurred. If an error occurred, this VI returns a description of the

error and optionally displays a dialog box.

YELLAPU MADHURI

33

Figure 3.5 Flow diagram speech to sign translation

YELLAPU MADHURI

34

Figure 3.6 Block diagram speech to sign translation

3.2.1 SPEECH RECOGNITION

A speech recognition system consists of the following:

• A microphone, for the person to speak into.

• Speech recognition software.

• A computer to take and interpret the speech.

• A good quality soundcard for input and/or output.

Voice-recognition software programs work by analyzing sounds and converting them to text.

They also use knowledge of how English is usually spoken to decide what the speaker most

probably said. Once correctly set up, the systems should recognize around 95% of what is said if

you speak clearly. Several programs are available that provide voice recognition. These systems

have mostly been designed for Windows operating systems; however programs are also available

for Mac OS X. In addition to third-party software, there are also voice-recognition programs built

in to the operating systems of Windows Vista and Windows 7. Most specialist voice applications

include the software, a microphone headset, a manual and a quick reference card. You connect the

microphone to the computer, either into the soundcard (sockets on the back of a computer) or via a

USB or similar connection. The latest versions of Microsoft Windows have a built-in voice-

recognition program called Speech Recognition. It does not have as many features as Dragon

YELLAPU MADHURI

35

NaturallySpeaking but does have good recognition rates and is easy to use. As it is part of the

Windows operating system, it does not require any additional cost apart from a microphone.

The input voice recognition is achieved using windows 7 inbuilt Speech Recognition software.

When the program is started, instructions appear to setup the microphone and a tutorial begins

which gives steps to proceed for user voice recognition.

A computer doesn't speak your language, so it must transform your words into something it

can understand. A microphone converts your voice into an analog signal and feeds it to your PC's

sound card. An analog-to-digital converter takes the signal and converts it to a stream of digital

data (ones and zeros). Then the software goes to work. While each of the leading speech

recognition companies has its own proprietary methods, the two primary components of speech

recognition are common across products. The first piece, called the acoustic model, analyzes the

sounds of your voice and converts them to phonemes, the basic elements of speech. The English

language contains approximately 50 phonemes.

Here's how it breaks down your voice: First, the acoustic model removes noise and unneeded

information such as changes in volume. Then, using mathematical calculations, it reduces the data

to a spectrum of frequencies (the pitches of the sounds), analyzes the data, and converts the words

into digital representations of phonemes. The software operation is as explained below.

Figure 3.7 Speech recognizer tutorial window

YELLAPU MADHURI

36

3.2.2 STEPS TO VOICE RECOGNITION

• ENROLMENT

Everybody’s voice sounds slightly different, so the first step in using a voice-recognition

system involves reading an article displayed on the screen. This process, called enrolment, takes

less than 10 minutes and results in a set of files being created which tell the software how you

speak. Many of the newer voice-recognition programs say this is not required, however it is still

worth doing to get the best results. The enrolment only has to be done once, after which the

software can be started as needed.

• DICTATING AND CORRECTING

When talking, people often hesitates, mumble or slur their words. One of the key skills in

using voice-recognition software is learning how to talk clearly so that the computer can recognize

what you are saying. This means planning what to say and then speaking in complete phrases or

sentences. The voice-recognition software will misunderstand some of the words spoken, so it is

necessary to proofread and then correct any mistakes. Corrections can be made by using the mouse

and keyboard or by using your voice. When you make corrections, the voice-recognition software

will adapt and learn, so that (hopefully) the same mistake will not occur again. Accuracy should

improve with careful dictation and correction.

• INPUT

The first step in voice recognition (VR) is the input and digitization of the voice into VR-

capable software. This generally happens via an active microphone plugged into the computer. The

user speaks into the microphone, and an analog-to-digital converter (ADC) creates digital sound

files for the VR program to work with.

YELLAPU MADHURI

37

• ANALYSIS

The key to VR is in the speech analysis. VR programs take the digital recording and parse it

into small, recognizable speech bits called "phonemes," via high-level audio analysis software.

(There are approximately 40 of these in the English language.)

• SPEECH-TO-TEXT

Once the program has identified the phonemes, it begins a complex process of identification

and contextual analysis, comparing each string of recorded phonemes against text equivalents in its

memory. It then accesses its internal language database and pairs up the recorded phonemes with

the most probable text equivalents.

• OUTPUT

Finally, the VR software provides a word output to the screen, mere moments after speaking. It

continues this process, at high speed, for each word spoken into its program. Speech recognition

fundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audio

from a sound card into recognized speech. The elements of the pipeline are:

1. Transform the PCM digital audio into a better acoustic representation.

2. Apply a "grammar" so the speech recognizer knows what phonemes to expect. A grammar

could be anything from a context-free grammar to full-blown Language.

3. Figure out which phonemes are spoken.

4. Convert the phonemes into words.

• TRANSFORM THE PCM DIGITAL AUDIO

The first element of the pipeline converts digital audio coming from the sound card into a

format that's more representative of what a person hears. The wave format can vary. In other

words, it may be 16 KHz 8 bit Mono/Stereo or 8 KHz 16 bit mono, and so forth. It's a wavy line

YELLAPU MADHURI

38

that periodically repeats while the user is speaking. When in this form, the data isn't useful to

speech recognition because it's too difficult to identify any patterns that correlate to what was

actually said. To make pattern recognition easier, the PCM digital audio is transformed into the

"frequency domain." Transformations are done using a windowed Fast-Fourier Transform (FFT).

The output is similar to what a spectrograph produces. In frequency domain, you can identify the

frequency components of a sound. From the frequency components, it's possible to approximate

how the human ear perceives the sound.

The FFT analyzes every 1/100th of a second and converts the audio data into the frequency

domain. Each 1/100th of second's results are a graph of the amplitudes of frequency components,

describing the sound heard for that 1/100th of a second. The speech recognizer has a database of

several thousand such graphs (called a codebook) that identify different types of sounds the human

voice can make. The sound is "identified" by matching it to its closest entry in the codebook,

producing a number that describes the sound. This number is called the "feature number."

(Actually, there are several feature numbers generated for every 1/100th of a second, but the

process is easier to explain assuming only one.) The input to the speech recognizer began as a

stream of 16,000 PCM values per second. By using Fast-Fourier Transforms and the codebook, it is

boiled down into essential information, producing 100 feature numbers per second.

• FIGURE OUT WHICH PHONEMES ARE SPOKEN

To figure out which phonemes are spoken the following procedure is used.

• Start by grouping. To make the recognition process easier to understand, you first should know

how the recognizer determines what phonemes were spoken and then understand the grammars.

• Every time a user speaks a word, it sounds different. Users do not produce exactly the same

sound for the same phoneme.

• The background noise from the microphone and user's office sometimes causes the recognizer

to hear a different vector than it would have if the user were in a quiet room with a high-quality

microphone.

YELLAPU MADHURI

39

• The sound of a phoneme changes depending on what phonemes surround it. The "t" in "talk"

sounds different than the "t" in "attack" and "mist."

The sound produced by a phoneme changes from the beginning to the end of the phoneme,

and is not constant. The beginning of a "t" will produce different feature numbers than the end of a

"t."The background noise and variability problems are solved by allowing a feature number to be

used by more than just one phoneme, and using statistical models to figure out which phoneme is

spoken. This can be done because a phoneme lasts for a relatively long time, 50 to 100 feature

numbers, and it's likely that one or more sounds are predominant during that time. Hence, it's

possible to predict what phoneme was spoken.

The speech recognizer needs to know when one phoneme ends and the next begins. Speech

recognition engines use a mathematical technique called "Hidden Markov Models" (HMMs) that

figure this out. The speech recognizer figures out when speech starts and stops because it has a

"silence" phoneme, and each feature number has a probability of appearing in silence, just like any

other phoneme. Now, the recognizer can recognize what phoneme was spoken if there's

background noise or the user's voice had some variation. However, there's another problem. The

sound of phonemes changes depending upon what phoneme came before and after. You can hear

this with words such as "he" and "how". You don't speak a "h" followed by an "ee" or "ow," but the

vowels intrude into the "h," so the "h" in "he" has a bit of "ee" in it, and the "h" in "how" as a bit of

"ow" in it.

Speech recognition engines solve the problem by creating "tri-phones," which are phonemes

in the context of surrounding phonemes. Thus, there's a tri-phone for "silence-h-ee" and one for

"silence-h-ow." Because there are roughly 50 phonemes in English, you can calculate that there are

50*50*50 = 125,000 tri-phones. That's just too many for current PCs to deal with so similar

sounding tri-phones are grouped together.

The sound of a phoneme is not constant. A "t" sound is silent at first, and then produces a

sudden burst high frequency of noise, which then fades to silence. Speech recognizers solve this by

splitting each phoneme into several segments and generating a different segment for each segment.

The recognizer figures out where each segment begins and ends in the same way it figures out

where a phoneme begins and ends.

YELLAPU MADHURI

40

A speech recognizer works by hypothesizing a number of different "states" at once. Each

state contains a phoneme with a history of previous phonemes. The hypothesized state with the

highest score is used as the final recognition result.

When the speech recognizer starts listening, it has one hypothesized state. It assumes the user

isn't speaking and that the recognizer is hearing the "silence" phoneme. Every 1/100th of a second,

it hypothesizes that the user has started speaking, and adds a new state per phoneme, creating 50

new states, each with a score associated with it. After the first 1/100th of a second, the recognizer

has 51 hypothesized states.

In 1/100th of a second, another feature number comes in. The scores of the existing states are

recalculated with the new feature. Then, each phoneme has a chance of transitioning to yet another

phoneme, so 51 * 50 = 2550 new states are created. The score of each state is the score of the first

1/100th of a second times the score if the 2nd 1/100th of a second. After 2/100ths of a second, the

recognizer has 2601 hypothesized states.

This same process is repeated every 1/100th of a second. The score of each new hypothesis is

the score of its parent hypothesis times the score derived from the new 1/100th of a second. In the

end, the hypothesis with the best score is what's used as the recognition result.

• ADAPTATION

Speech recognition system "adapt" to the user's voice, vocabulary, and speaking style to

improve accuracy. A system that has had time enough to adapt to an individual can have one fourth

the error rate of a speaker-independent system. Adaptation works because the speech recognition is

often informed (directly or indirectly) by the user if it's recognition was correct, and if not, what the

correct recognition is.

The recognizer can adapt to the speaker's voice and variations of phoneme pronunciations in

a number of ways. First, it can gradually adapt the codebook vectors used to calculate the acoustic

feature number. Second, it can adapt the probability that a feature number will appear in a

phoneme. Both of these are done by weighted averaging.

YELLAPU MADHURI

41

The language model also can be adapted in a number of ways. The recognizer can learn new

words, and slowly increase probabilities of word sequences so that commonly used word sequences

are expected. Both these techniques are useful for learning names.

Although not common, the speech recognizer can adapt word pronunciations in its lexicon.

Each word in a lexicon typically has one pronunciation. The word "orange" might be pronounced

like "or-anj." However, users will sometimes speak "ornj" or "or-enj." The recognizer can

algorithmically generate hypothetical alternative pronunciations for a word. It then listens for all of

these pronunciations during standard recognition, "or-anj," "or-enj," "or-inj," and "ornj." During

the process of recognition, one of these pronunciations will be heard, although there's a fair chance

that the recognizer heard a different pronunciation than what the user spoke. However, after the

user has spoken the word a number of times, the recognizer will have enough examples that it can

determine what pronunciation the user spoke.

However, speech recognition (by a machine) is a very complex problem. Vocalizations vary

in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed.

Speech is distorted by a background noise and echoes, electrical characteristics. Accuracy of

speech recognition varies with the following:

• Vocabulary size and confusability

• Speaker dependence vs. independence

• Isolated, discontinuous, or continuous speech

• Task and language constraints

• Read vs. spontaneous speech

• Adverse conditions

YELLAPU MADHURI

42

4 RESULTS AND DISCUSSIONS

4.1 RESULTS

In this section we will analyze the performance of the system by its capability to recognize

gestures from images. We also discuss the difficulties faced while designing the system.

4.1.1 APPLICATION

The software is a standalone application. To install the file, follow the instructions that

appear in the executable installer file.

Figure 4.1 Application Installer

After installing the application, a Graphical user interfacing window opens, from which the

full application can be used. The graphical User Interface (GUI) has been created to run the entire

application from a single window. The GUI has four pages namely page 1, page 2, page 3 and page

4, each page corresponds to a specific application.

YELLAPU MADHURI

43

• Page 1 gives a detailed demo of the total software usage.

• Page 2 is for speech to sign language translation.

• Page 3 is for template preparation for sign to speech translation.

• Finally page 4 is for Sign to speech translation.

Figure 4.2 Application window

The functions of the various buttons that appear on the window are as explained below.

To run the application

To stop the application

To go to previous page

To go to next page

YELLAPU MADHURI

44

4.1.1.1 PAGE 1

This page consists of detailed instructions to execute the entire application. To continue to

the specific application use the previous and next buttons.

4.1.1.2 PAGE 2

This page consists of Speech to Sign language translator. The window appearance is as

shown in figure 4.3. The working of this module is as explained below.

Figure 4.3 GUI of Speech to Sign translation

Building a speech recognition program is not in the scope of this project. Instead integrating

an existing speech recognition engine into the program can be very easy. When the “start” button is

YELLAPU MADHURI

45

pressed, a command is sent to the Windows 7 inbuilt Speech Recognizer and it opens a mini

window at the top. The first time it is started, a tutorial session begins which gives instructions to

setup the microphone and recognize the user’s voice input. Configure the speech recognition

software. In order for your application to be able to take full advantage of speech recognition, the

speech recognition program must be correctly configured. This means that microphone and

language settings must be set appropriately to take optimal advantage of the speech recognition

program's capabilities.

Voice recognition training teaches the software to recognize your voice and speech patterns.

Training involves reading the given paragraphs or single words into the software using a

microphone. The more you repeat the process, the more the program should accurately transcribe

your speech.

According to a Landmark College article, most people get frustrated with the training process

and feel it's too time consuming. Before you decide to skip training, you should think about the

consequences. The software will incorrectly transcribe your speech more often than not, which will

make the software less efficient.

Speaking clearly and succinctly during training makes it easier for the software to recognize

your voice. As a result, you'll spend less time training, repeating yourself and correcting the

program. It also helps to use a good-quality microphone that easily registers your voice. Speech

recognition package also tune itself to the individual user. The software customizes itself based on

your voice, your unique speech patterns, and your accent. To improve dictation accuracy, it creates

a supplementary dictionary of the words you use.

After the initial training, from the next time the program is executed, it starts speech

recognition automatically. To train the system for a different user or change the microphone

settings, right click on the Speech Recognizer window and select “Start Speech Tutorial”.

To stop the speech recognition software select the icon or say “Stop listening”. The

Speech recognizer will go to sleep mode.

YELLAPU MADHURI

46

Figure 4.4 Speech recognizer in sleep mode.

To start speech recognition again select the icon or say “Start Listening”. The Speech

recognizer will go to active mode.

Figure 4.5 Speech recognizer in active mode.

If the user’s speech input is not clear then it asks to repeat the input again.

Figure 4.6 Speech recognizer when input speech is not clear for recognition.

The active working mode appearance of the Speech to Sign language translator module

window is as shown in figure 4.7. When the user utters any of the words listed in the “Phrases”

near the microphone, the input sound is processed for recognition. If the input sound matches the

words in the database, it is displayed in the “Command” alphanumeric indicator. A sign language

gesture picture corresponding to the speech input is displayed in the “Sign” picture indicator. Also

the score of speech input correlation with the trained word is displayed in the “Score” numeric

indicator. Use the exit button to exit the application of speech to sign language translation. To

extend the application to translate more input spoken English words to Sign language picture

display output, simply include the sign language images in the folder “Sign Images” and add the

word to the list in the “Phrases”.

YELLAPU MADHURI

47

Figure 4.7 GUI of working window of speech to sign translation

Figure 4.8 Block diagram of speech to sign translation

YELLAPU MADHURI

48

4.1.1.3 PAGE 3

Figure 4.9 GUI of template preparation


YELLAPU MADHURI

49

This page consists of template preparation setup for Sign language to Speech translator. The

window appearance is as shown in figure 4.11. The working of this module is as explained below.

To execute the template preparation module for Sign language to speech translation, press the

“Start” button. Choose the camera to acquire images to be used as templates, from the “Camera

Name” list. The acquired image is displayed on “Image” picture indicator. If the display image is

good to be used for preparing a template, press “Snap frame”. The snapped image is displayed on

“Snap Image” picture display. Draw a region of interest to prepare the template and press “Learn”.

The image region in the selected portion of the snapped frame is saved to the folder specified for

templates. The saved template image is displayed on “Template Image” picture display. Press

“Stop” button to stop execution of template preparation module.

Figure 4.11 GUI of working window of template preparation

YELLAPU MADHURI

50

4.1.1.4 PAGE 4

This page consists of Sign to Speech translator. When started it captures the signs performed

by the deaf user in real time and compares them with created template images and gives an audio

output when a match is found. The window appearance is as shown in figure 4.1.1.2. The working

of this module is as explained below.

Figure 4.12 GUI of sign to speech translation

Press the “Start” button to start the program. The “Camera Name” indicator displays the list of

all the cameras that are connected to the computer. Choose the camera from the list. Adjust the

selected camera position to capture the sign gestures performed by the user. For the performed test

the camera is fixed at a distance of one meter from the user’s hand. The captured images are displayed

on the “Input Image” picture display. Press the “Match” button to start comparing the acquired input

image with the template images in the data base. In every iteration, the input image is checked for

pattern match with one template. When the input image matches with the template image, the loop

YELLAPU MADHURI

51

halts. The “Match” LED glows and the matched template is displayed on the “Template Image”

indicator. If the input image does not match with any of the images from the database of templates,

then the audio output says “NONE” and the “Match” LED do not glow.

The loop iteration count is used for triggering a case structure. Depending on the iteration

count value a specific case is selected and gives a string output. Otherwise the loop continues to next

iteration where the input image is checked for pattern match with a new template. The information in

the string output from case structure is displayed on the “Matched Pattern” alphanumeric indicator. It

also initiates the .NET speech synthesizer to give an audio output through the speaker.

Figure 4.13 GUI of working window of sign to speech translation

To pause the pattern matching while the program is still running, press the “Match” button.

This makes the pattern matching step to go to inactive mode. The acquired image is just displayed

on the Input image indicator but does not go for pattern matching. To resume pattern matching

press the “match” button again. It is highlighted and indicates that it is in active mode.

YELLAPU MADHURI

52


Figure 4.15 Block diagram of pattern matching

YELLAPU MADHURI

53

For sign language to spoken English translation, the classification of different gestures is

done using pattern matching technique for 36 different gestures (Alphabets A to Z and numbers 1

to 9) of Indian sign language. The performance of the system is evaluated based on its ability to

correctly recognize signs to their corresponding speech class. The recognition rate is defined as the

ratio of the number of correctly classified signs to the total number of signs:

Recognition Rate= Number of Correctly Classified Signs × (%) 100

Total Number of Signs

The proposed approach has been assessed using input sequences containing user performing

various gestures in indoor environment for alphabets A to Z and numbers 1 to 9. This section will

present results obtained from a sequence depicting a person performing a variety of hand gestures

in a setup that is typical for deaf and normal person interaction applications. i.e the subject is sitting

at a typical distance of about 1m from the camera. The resolution of the sequence is 640* 480 and

it was obtained with a standard, low-end web camera at 30 frames per second.

The total number of signs used for testing is 36 and the system recognition rate is 100% for

inputs similar to database. The system was implemented with LABVIEW version 2012.

4.2 DISCUSSIONS

For Sign language to speech translation, the gesture recognition problem consists of pattern

representation and recognition. In the previous related works, hidden Markov model (HMM) is

used widely in speech recognition, and a number of researchers have applied HMM to temporal

gesture recognition. Yang and Xu (1994) proposed gesture-based interaction using a multi-

dimensional HMM. They used a Fast Fourier Transform (FFT) to convert input gestures to a

sequence of symbols to train the HMM. They reported 99.78% accuracy for detecting 9 gestures.

Watnabe and Yachida (1998) proposed a method of gesture recognition from image

sequences. The input image is segmented using maskable templates and then the gesture space is

constituted by Karhunen-Loeve (KL) expansion using the segment. They applied Eigen vector-

based matching for gesture detection.

YELLAPU MADHURI

54

Oka, Satio and Kioke (2002) developed a gesture recognition based on measured finger

trajectories for an augmented desk interface system. They used a Kalman-Filter for predicting the

location of multiple fingertips and HMM for gesture detection. They have reported average

accuracy of 99.2% for single finger gestures produced by one person. Ogawara et al. (2001)

proposed a method of constructing a human task model by attention point (AP) analysis. Their

target application was gesture recognition for human-robot interaction.

New et al. (2003) proposed a gesture recognition system for hand tracking and detecting the

number of fingers being held up to control an external device, based on hand-shape template

matching. Perrin et al. (2004) described a finger tracking gesture recognition system based on laser

tracking mechanism which can be used in hand-held devices. They have used HMM for their

gesture recognition system with an accuracy of 95% for 5 gesture symbols at a distance of 30cm to

their device.

Lementec and Bajcsy (2004) proposed an arm gesture recognition algorithm from Euler

angles acquired from multiple orientation sensors, for controlling unmanned aerial vehicles in

presence of manned aircrew. Dias et al. (2004) described their vision-based open gesture

recognition engine called OGRE, reporting detection and tracking of hand contours using template

matching with accuracy of 80% to 90%.

Because of the difficulty of data collection for training an HMM for temporal gesture

recognition, the vocabularies are very limited, and to reach to an acceptable accuracy, the process

is excessively data and time intensive. Some researchers have suggested that a better approach is

needed for use with more complex systems (Perrin et al., 2004).

This paper presents a novel approach for gesture detection. This approach has two main

steps: i) gesture template preparation, and ii) gesture detection. The gesture template preparation

technique which is presented here has some important features for gesture recognition including

robustness against slight rotation, small number of required features, invariant to the start position

and device independence. For gesture detection, a pattern matching technique is used. The results

of our first experiment show 99.72 % average accuracy in single gesture detection. Based on the

high accuracy of the gesture classification, the number of templates seems to be enough for

detecting a limited number of gestures. However, more accurate judgment requires a larger number

of gestures in the gesture-space to further validate this assertion.

YELLAPU MADHURI

55

The gesture recognition technique introduced in this article can be used with a variety of

front-end input systems such as vision based input , hand and eye tracking, digital tablet, mouse,

and digital glove. Much previous work has focused on isolated sign language recognition with

clear pauses after each sign, although the research focus is slowly shifting to continuous

recognition. These pauses make it a much easier problem than continuous recognition without

pauses between the individual signs, because explicit segmentation of a continuous input stream

into the individual signs is very difficult. For this reason, and because of co-articulation effects,

work on isolated recognition often does not generalize easily to continuous recognition.

But the proposed software captures the input images as an AVI sequence of continuous

images. This allows for continuous input image acquisition without pauses. But each image frame

is processed individually and checked for pattern matching. This technique overcomes the problem

of processing continuous images at the same time having input stream without pauses.

For Speech to Sign language translation words of similar pronunciation are sometimes

misinterpreted. This problem can be avoided by clearly pronouncing the words and with extended

training and increasing usage.

ALPHABET A

ALPHABET B

ALPHABET C

ALPHABET D

ALPHABET E

ALPHABET F

ALPHABET G

ALPHABET H

ALPHABET I

ALPHABET J

YELLAPU MADHURI

56

ALPHABET K

ALPHABET L

ALPHABET M

ALPHABET N

ALPHABET O

ALPHABET P

ALPHABET Q

ALPHABET R

ALPHABET S

ALPHABET T

ALPHABET U

ALPHABET V

ALPHABET W

ALPHABET X

ALPHABET Y

ALPHABET Z

Figure 4.16 Data base of sign templates

YELLAPU MADHURI

57

NUMBER 1

NUMBER 2

NUMBER 3

NUMBER 4

NUMBER 5

NUMBER 6

NUMBER 7

NUMBER 8

NUMBER 9

Figure 4.17 Data base of sign number templates

YELLAPU MADHURI

58

5 CONCLUSIONS AND FUTURE ENHANCEMENT

5.1 CONCLUSIONS

This sign language translator is able to translate alphabets (A-Z) and numbers (1-9). All the

signs can be translated real-time. But signs that are similar in posture and gesture to another sign

can be misinterpreted, resulting in a decrease in accuracy of the system. The current system has

only been trained on a very small database. Since there will always be variation in either the

signers hand posture or motion trajectory, to increase the performance and accuracy of the system,

the quality of the training database used should be enhanced to ensure that the system picks up

correct and significant characteristics in each individual sign and further improve the performance

more efficiently. A larger dataset will also allow experimenting further on performance in different

environments. Such a comparison will allow to tangibly measuring the robustness of the system in

changing environments and provide training examples for a wider variety of situations. Adaptive

color models and improved tracking could boost performance of the vision system.

Current collaboration with Assistive Technology researchers and members of the Deaf

community for continued design work is under progress. The gesture recognition technology is

only one component of a larger system that we hope to one day be an active tool for the Deaf

community.

This project did not focus on facial expressions although it is well known that facial

expressions convey important part of sign-languages. The facial expressions can e.g. be extracted

by tracking the signers’ face. Then, the most discriminative features can be selected by employing

a dimensionality reduction method and this cue could also be fused into the recognition system.

This system can be implemented in many application areas examples include accessing

government websites whereby no video clip for deaf and mute is available or filling out forms

whereby no interpreter may be present to help.

For the future work, there are many possible improvements that can extend this work. First of

all, more diversified hand samples from different people can be used in the training process so that

YELLAPU MADHURI

59

the system will be more user independent. The second improvement could be context-awareness

for the gesture recognition system. The same gesture performed within different contexts and

environments can have different semantic meanings. Another possible improvement is to track and

recognize multiple objects such as human faces, eye gaze and hand gestures at the same time. With

this multi-model based tracking and recognition strategy, the relationships and interactions among

these tracked objects can be defined and assigned with different semantic meanings so that a richer

command set can be covered. By integrating this richer command set with other communication

modalities such as speech recognition and haptic feedback, the Deaf user communication

interaction experience can be enriched greatly and be much more interesting.

The system developed in this work can be extended to many other research topics in the field

of computer vision and sign language translation techniques. We hope this project could trigger

more investigations to make translation systems see and think better.

5.2 FUTURE ENHANCEMENT

5.2.1 APPLICATIONS OF SIGN RECOGNITION

The sign language recognition can be used to assist the communication of Deaf persons to

interact efficiently with non sign language users without the intervention of an interpreter. It can be

installed at government organizations and other public services. It can be made to incorporate with

internet for live video conferences between deaf and normal people.

5.2.2 APPLICATIONS OF SPEECH RECOGNITION

There are a number of scenarios where speech recognition is either being delivered,

developed for, researched or seriously discussed. As with many contemporary technologies, such

as the Internet, online payment systems and mobile phone functionality, development is at least

partially driven by the trio of often perceived evils.

YELLAPU MADHURI

60

• COMPUTER AND VIDEO GAMES

Speech input has been used in a limited number of computer and video games, on a variety of

PC and console-based platforms, over the past decade. For example, the game Seaman24 involved

growing and controlling strange half-man half fish characters in a virtual aquarium. A microphone,

sold with the game, allowed the player to issue one of a pre-determined list of command words and

questions to the fish. The accuracy of interpretation, in use, seemed variable; during gaming

sessions colleagues with strong accents had to speak in an exaggerated and slower manner in order

for the game to understand their commands.

Microphone-based games are available for two of the three main video game consoles

(Playstation 2 and Xbox). However, these games primarily use speech in an online player to player

manner, rather than spoken words being interpreted electronically. For example, a MotoGP for the

Xbox allows online players to ride against each other in a motorbike race simulation, and speak

(via microphone headset) to the nearest players (bikers) in the race. There is currently interest, but

less development, of video games that interpret speech.

• PRECISION SURGERY

Developments in keyhole and micro surgery have clearly shown that an approach of as little

invasive or non-essential surgery as possible increases success rates and patient recovery times.

There is occasional speculation in various medical for a regarding the use of speech recognition in

precision surgery, where a procedure is partially or totally carried out by automated means.

For example, in removing a tumour or blockage without damaging surrounding tissue, a

command could be given to make an incision of a precise and small length e.g. 2 millimeters.

However, the legal implications of such technology are a formidable barrier to significant

developments in this area. If speech was incorrectly interpreted and e.g. a limb was accidentally

sliced off, who would be liable – the surgeon, the surgery system developers, or the speech

recognition software developers.

YELLAPU MADHURI

61

• DOMESTIC APPLICATIONS

There is inevitable, interest in the use of speech recognition in domestic appliances such as

ovens, refrigerators, dishwashers and washing machines. One school of thought is that, like the use

of speech recognition in cars, this can reduce the number of parts and therefore the cost of

production of the machine. However, removal of the normal buttons and controls would present

problems for people who, for physical or learning reasons, cannot use speech recognition systems.

• WEARABLE COMPUTERS

Perhaps the most futuristic application is in the use and functionality of wearable computers 25

i.e. unobtrusive devices that you can wear like a watch, or are even embedded in your clothes.

These would allow people to go about their everyday lives, but still store information (thoughts,

notes, to-do lists) verbally, or communicate via email, phone or videophone, through wearable

devices. Crucially, this would be done without having to interact with the device, or even

remember that it is there; the user would just speak, the device would know what to do with the

speech, and would carry out the appropriate task.

The rapid miniaturization of computing devices, the rapid rise in processing power, and

advances in mobile wireless technologies, are making these devices more feasible. There are still

significant problems, such as background noise and the idiosyncrasies of an individual’s language,

to overcome. However, it is speculated that reliable versions of such devices will become

commercially available during this decade.

YELLAPU MADHURI

62

REFERENCES

[1] Andreas Domingo, Rini Akmeliawati, Kuang Ye Chow ‘Pattern Matching for Automatic Sign

Language Translation System using LabVIEW’, International Conference on Intelligent and

Advanced Systems 2007.

[2] Beifang Yi Dr. Frederick C. Harris ‘A Framework for a Sign Language Interfacing System’, A

dissertation submitted in partial fulllment of the requirements for the degree of Doctor of

Philosophy in Computer Science and Engineering May 2006 University of Nevada, Reno.

[3] Fernando Lo´ pez-Colino n, Jose´ Cola´ s (2012), ‘Spanish Sign Language synthesis system’,

Journal of Visual Languages and Computing 23 (2012) 121–136.

[4] Helene Brashear & Thad Starner ‘Using Multiple Sensors for Mobile Sign Language

Recognition’, ETH - Swiss Federal Institute of Technology Wearable Computing Laboratory 8092

Zurich, Switzerland flukowicz, junker [email protected]

[5] Jose L. Hernandez-Rebollar1, Nicholas Kyriakopoulos1, Robert W. Lindeman2 ‘A New

Instrumented Approach For Translating American Sign Language Into Sound And Text’,

Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture

Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE.

YELLAPU MADHURI

63

[6] K. Abe, H. Saito, S. Ozawa: Virtual 3D Interface System via Hand Motion Recognition From

Two Cameras. IEEE Trans. Systems, Man, and Cybernetics, Vol. 32, No. 4, pp. 536–540, July

2002.

[7] Paschaloudi N. Vassilia, Margaritis G. Konstantinos "Listening to deaf': A Greek sign language

translator’, 0-7803-9521-2/06/$20.00 §2006 IEEE.

[8] Rini Akmeliawatil, Melanie Po-Leen Ooi2, Ye Chow Kuang3 ‘Real-Time Malaysian Sign

Language Translation using Colour Segmentation and Neural Network’, IMTC 2007 -

Instrumentation and Measurement Technology Conference Warsaw, Poland, 1-3 May 2007.

[9] R. Bowden, D. Windridge, T. Kabir, A. Zisserman, M. Bardy: ‘A Linguaistic Feature Vector

for the Visual Interpretation of Sign Languag’, In Proceedings of ECCV 2004, the 8th European

Conference on Computer Vision, Vol. 1, pp. 391–401, Prague, Czech Republic, 2004.

[10] Ravikiran J, Kavi Mahesh, Suhas Mahishi, Dheeraj R, Sudheender S, Nitin V Pujari, (2009),

‘Finger Detection for Sign Language Recognition’, Proceedings of the International

MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20,

2009, Hong Kong.

[11] S. Akyol, U. Canzler, K. Bengler, W. Hahn: ‘Gesture Control for Use in Automobiles’, In

Proc. IAPR Workshop Machine Vision Applications, pp. 349–352, Tokyo, Japan, Nov. 2000.

YELLAPU MADHURI

64

[12] Verónica López-Lude˜na ∗, Rubén San-Segundo, Juan Manuel Montero, Ricardo Córdoba,

Javier Ferreiros, José Manuel Pardo, (2011), ‘Automatic categorization for improving Spanish into

Spanish Sign Language machine translation’, Computer Speech and Language 26 (2012) 149–167.

[13] Scientific understanding and vision-based technological development for continuous sign

language recognition and translation – www.signspeak.eu – FP7-ICT-2007-3-231424.