Upload
madhuri-yellapu
View
1.289
Download
4
Embed Size (px)
DESCRIPTION
Citation preview
YELLAPU MADHURI
AUTOMATIC LANGUAGE TRANSLATION SOFTWARE FOR AIDING COMMUNICATION BETWEEN INDIAN SIGN
LANGUAGE AND SPOKEN ENGLISH USING LABVIEW
A PROJECT REPORT
Submitted in partial fulfillment for the award of the degree of
MASTER OF TECHNOLOGY
in
BIOMEDICAL ENGINEERING
by
YELLAPU MADHURI (1651110002)
Under the Guidance of
Ms. G. ANITHA, (Assistant Professor)
DEPARTMENT OF BIOMEDICAL ENGINEERING
SCHOOL OF BIOENGINEERING FACULTY OF ENGINEERING & TECHNOLOGY
SRM UNIVERSITY (Under section 3of UGC Act, 1956) SRM Nagar, Kattankulathur-603203
Tamil Nadu, India
MAY 2013
YELLAPU MADHURI
2
ACKNOWLEDGEMENT
First and foremost, I express my heartfelt and deep sense of gratitude to our Chancellor
Shri. T. R. Pachamuthu, Shri. P. Ravi Chairman of the SRM Group of Educational Institutions,
Prof. P.Sathyanarayanan, President, SRM University, Dr. R.Shivakumar, Vice President, SRM
University, Dr. M.Ponnavaikko, Vice Chancellor, for providing me the necessary facilities for the
completion of my project. I also acknowledge Registrar Dr. N. Sethuraman for his constant support
and endorsement.
I wish to express my sincere gratitude to our Director (Engineering & Technology) Dr. C.
Muthamizhchelvan for his constant support and encouragement.
I am extremely grateful to the Head of the Department Dr. M. Anburajan for his invaluable
guidance, motivation, timely and insightful technical discussions. I am immensely grateful for his
constant encouragement, smooth approach throughout our project period and make this work
possible.
I am indebted to my Project Co-coordinators Mrs. U.Snekhalatha and Mrs. Varshini
Karthik for their valuable suggestions and motivation. I am deeply indebted to my Internal Guide
Ms. G. Anitha, and faculties of Department of Biomedical Engineering for extending their warm
support, constant encouragement and ideas they shared with us.
I would be failing in my part if I do not acknowledge my family members and my friends
for their constant encouragement and support.
YELLAPU MADHURI
3
BONAFIDE CERTIFICATE
This is to certify that the Project entitled "AUTOMATIC LANGUAGE TRANSLATION
SOFTWARE FOR AIDING COMMUNICATION BETWEEN INDIAN SIGN LANGUAGE
AND SPOKEN ENGLISH USING LABVIEW" has been carried out by YELLAPU
MADHURI-1651110002 under the supervision of Ms. G. Anitha in partial fulfillment of the
degree of MASTER OF TECHNOLOGY in Biomedical Engineering, School of Bioengineering,
SRM University, during the academic year 2012-2013(Project Work Phase –II, Semester -IV). The
contents of this report, in full or in parts, have not been submitted to any institute or university for
the award of any degree or diploma.
Signature Signature HEAD OF THE DEPARTMENT INTERNAL GUIDE (Dr. M. Anburajan) (Ms. G. Anitha) Department of Biomedical Engineering, Department of Biomedical Engineering, SRM University, SRM University, Kattankulathur – 603 203. Kattankulathur – 603 203. INTERNAL EXAMINER EXTERNAL EXAMINER
YELLAPU MADHURI
4
ABSTRACT
This paper presents SIGN LANGUAGE TRANSLATION software for automatic translation
of Indian sign language into spoken English and vice versa to assist the communication between
speech and/or hearing impaired people and hearing people. It could be used by deaf community as
a translator to people that do not understand sign language, avoiding by this way the intervention
of an intermediate person for interpretation and allow communication using their natural way of
speaking. The proposed software is standalone executable interactive application program
developed using LABVIEW software that can be implemented in any standard windows operating
laptop, desktop or an IOS mobile phone to operate with the camera, processor and audio device.
For sign to speech translation, the one handed Sign gestures of the user are captured using
camera; vision analysis functions are performed in the operating system and provide
corresponding speech output through audio device. For speech to sign translation the speech input
of the user is acquired by microphone; speech analysis functions are performed and provide sign
gesture picture display of corresponding speech input. The experienced lag time for translation is
little because of parallel processing and allows for instantaneous translation from finger and hand
movements to speech and speech inputs to sign language gestures. This system is trained to
translate one handed sign representations of alphabets (A-Z), numbers (1-9) to speech and 165
word phrases to sign gestures The training database of inputs can be easily extended to expand the
system applications. The software does not require the user to use any special hand gloves. The
results are found to be highly consistent, reproducible, with fairly high precision and accuracy.
YELLAPU MADHURI
5
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.
ABSTRACT I
LIST OF FIGURES IV
LIST OF ABBREVIATIONS VI
1 INTRODUCTION 1
1.1 HEARING IMPAIRMENT 2
1.2 NEED FOR THE SYSTEM 7
1.3 AVAILABLE MODELS 8
1.4 PROBLEM DEFINITION 8
1.5 SCOPE OF THE PROJECT 9
1.6 FUTURE PROSPECTS 10
1.7 ORGANISATION OF REPORT 10
2 AIM AND OBJECTIVES
OF THE PROJECT 11
2.1 AIM 11
2.2 OBJECTIVES 11
3 MATERIALS AND METHODOLOGY 12
3.1 SIGN LANGUAGE TO SPOKEN 13
ENGLISH TRANSLATION
3.2 SPEECH TO SIGN LANGUAGE 21
TRANSLATOR
YELLAPU MADHURI
6
4 RESULTS AND DISCUSSIONS 32
4.1 RESULTS 32
4.2 DISCUSSIONS 43
5 CONCLUSIONS AND FUTURE ENHANCEMENTS 48
5.1 CONCLUSIONS 48
5.2 FUTURE ENHANCEMENT 49
REFERENCES 52
YELLAPU MADHURI
7
LIST OF FIGURES
FIGURE PAGE NO.
1.1 Anatomy of human ear 3
1.2 Events involved in hearing 3
1.3 Speech chain 4
1.4 Block diagram of speech chain 4
3.1 Graphical abstract 12
3.2 Flow diagram of template preparation 16
3.3 Flow diagram of pattern matching 19
3.4 Block diagram of sign to speech translation 21
3.5 Flow diagram speech to sign translation 25
3.6 Block diagram speech to sign translation 24
3.7 Speech recognizer tutorial window 25
4.1 Application Installer 32
4.2 Application window 33
4.3 GUI of Speech to Sign translation 34
4.4 Speech recognizer in sleep mode 36
4.5 Speech recognizer in active mode 36
4.6 Speech recognizer when input speech is not clear for recognition 36
4.7 GUI of working window of speech to sign translation 37
4.8 Block diagram of speech to sign translation 37
YELLAPU MADHURI
8
4.9 GUI of template preparation 38
4.10 Block diagram of sign to speech translation 38
4.11 GUI of working window of template preparation 39
4.12 GUI of sign to speech translation 40
4.13 GUI of working window of sign to speech translation 41
4.14 Block diagram of sign to speech translation 42
4.15 Block diagram of pattern matching 42
4.16 Data base of sign templates 46
4.17 Data base of sign number templates 47
YELLAPU MADHURI
9
LIST OF ABBREVIATIONS
Sr.No ABBREVIATION EXPANSION
1 SL Sign language
2 BII Bahasa Isyarat India
3 SLT Sign language translator
4 ASLR Automatic sign language recognition
5 ASLT Automatic sign language translation
6 GSL Greek Sign Language
7 SDK Software development kit
8 RGB Red green blue
9 USB Universal serial bus
10 CCD Charge couple display
11 ASL American sign language
12 ASR Automatic sign recognition
13 HMM Hidden Markov model
14 LM Language model
15 OOV Out of vocabulary
YELLAPU MADHURI
10
1. INTRODUCTION
In India there are around 60 million people with hearing deficiencies. Deafness brings about
significant communication problems: most deaf people have serious problems when expressing
themselves in these languages or understanding written texts. This fact can cause deaf people to
have problems when accessing information, education, job, social relationship, culture, etc. It is
necessary to make a difference between “deaf” and “Deaf”: the first one refers to non-hearing
people, and the second one refers to non-hearing people who use a sign language to communicate
between themselves (their mother tongue), making them part of the “Deaf community”. Sign
language is a language through which communication is possible without the means of acoustic
sounds. Instead, sign language relies on sign patterns, i.e., body language, orientation and
movements of the arm to facilitate understanding between people. It exploits unique features of the
visual medium through spatial grammar. Sign languages are fully-fledged languages that have a
grammar and lexicon just like any spoken language, contrary to what most people think. The use of
sign languages defines the Deaf as a linguistic minority, with learning skills, cultural and group
rights similar to other minority language communities.
Hand gestures can be used for natural and intuitive human-computer interaction for
translating sign language to spoken language to assist communication of deaf community with non
sign language users. To achieve this goal, computers should be able to recognize hand gestures
from input. Vision-based gesture recognition can achieve an improved interaction, more intuitive
and flexible for the user. However vision-based hand tracking and gesture recognition is an
extremely challenging problem due to the complexity of hand gestures, which are rich in diversities
due to high degrees of freedom involved by the human hand. On the other hand, computer vision
algorithms are notoriously brittle and computation intensive, which make most current gesture
recognition systems fragile and inefficient. This report proposes a new architecture to solve the
problem of real-time vision-based hand tracking and gesture recognition. To recognize different
hand postures, a parallel cascades structure is implemented. This structure achieves real-time
performance and high translation accuracy. The 2D position of the hand is recovered according to
the camera’s perspective projection. To make the system robust against cluttered backgrounds,
YELLAPU MADHURI
11
background subtraction and noise removal are applied. The overall goal of this project is to develop
a new vision-based technology for recognizing and translating continuous sign language to spoken
English and vice-versa.
1.1 HEARING IMPAIRMENT
Hearing is one of the major senses and is important for distant warning and communication.
It can be used to alert, to communicate pleasure and fear. It is a conscious appreciation of vibration
perceived as sound. In order to do this, the appropriate signal must reach the higher parts of the
brain. The function of the ear is to convert physical vibration into an encoded nervous impulse. It
can be thought of as a biological microphone. Like a microphone the ear is stimulated by vibration:
in the microphone the vibration is transduced into an electrical signal, in the ear into a nervous
impulse which in turn is then processed by the central auditory pathways of the brain. The
mechanism to achieve this is complex.
The ears are paired organs, one on each side of the head with the sense organ itself, which is
technically known as the cochlea, deeply buried within the temporal bones. Part of the ear is
concerned with conducting sound to the cochlea; the cochlea is concerned with transducing
vibration. The transduction is performed by delicate hair cells which, when stimulated, initiate a
nervous impulse. Because they are living, they are bathed in body fluid which provides them with
energy, nutrients and oxygen. Most sound is transmitted by a vibration of air. Vibration is poorly
transmitted at the interface between two media which differ greatly in characteristic impedance.
The ear has evolved a complex mechanism to overcome this impedance mismatch, known as the
sound conducting mechanism. The sound conducting mechanism is divided into two parts, an outer
and the middle ear, an outer part which catches sound and the middle ear which is an impedance
matching device. Sound waves can be distinguished from each other by means of the differences in
their frequencies and amplitudes. For people suffering from any type of deafness, these differences
cease to exist. The anatomy of the ear and the events involved in hearing process are shown in
figure 1.1 and figure 1.2 respectively.
YELLAPU MADHURI
12
Figure 1.1 Anatomy of human ear
Figure 1.2 Events involved in hearing
YELLAPU MADHURI
13
Figure 1.3 Speech chain
Figure 1.4 Block diagram of speech chain
YELLAPU MADHURI
14
1.1.1 THE SPEECH SIGNAL
While you are producing speech sounds, the air flow from your lungs first passes the glottis
and then your throat and mouth. Depending on which speech sound you articulate, the speech
signal can be excited in three possible ways:
• VOICED EXCITATION
The glottis is closed. The air pressure forces the glottis to open and close periodically thus
generating a periodic pulse train (triangle–shaped). This ”fundamental frequency” usually lies in
the range from 80Hz to 350Hz.
• UNVOICED EXCITATION
The glottis is open and the air passes a narrow passage in the throat or mouth. This results in
a turbulence which generates a noise signal. The spectral shape of the noise is determined by the
location of the narrowness.
• TRANSIENT EXCITATION
A closure in the throat or mouth will raise the air pressure. By suddenly opening the closure
the air pressure drops down immediately. (”plosive burst”) With some speech sounds these three
kinds of excitation occur in combination. The spectral shape of the speech signal is determined by
the shape of the vocal tract (the pipe formed by your throat, tongue, teeth and lips). By changing
the shape of the pipe (and in addition opening and closing the air flow through your nose) you
change the spectral shape of the speech signal, thus articulating different speech sounds.
An engineer looking at (or listening to) a speech signal might characterize it as follows:
• The bandwidth of the signal is 4 kHz
• The signal is periodic with a fundamental frequency between 80 Hz and 350 Hz
YELLAPU MADHURI
15
• There are peaks in the spectral distribution of energy at (2n − 1) ∗ 500 Hz ; n = 1, 2, 3, . . . (1.1)
• The envelope of the power spectrum of the signal shows a decrease with increasing frequency
(-6dB per octave).
1.1.2 CAUSES OF DEAFNESS IN HUMANS
Many speech and sound disorders occur without a known cause. Some speech- sound errors
can result from physical problems such as
• Developmental disorders
• Genetic syndromes
• Hearing loss
• Illness
• Neurological disorders
Some of the major types are listed below.
• Genetic Hearing Loss
• Conductive Hearing Loss
• Perceptive Hearing Loss
• Pre-Lingual Deafness
• Post-Lingual Deafness
• Unilateral Hearing Loss
1. In some cases, hearing loss or deafness is due to hereditary factors. Genetics is considered to
play a major role in the occurrence of sensory neural hearing loss. Congenital deafness can
happen due to heredity or birth defects.
YELLAPU MADHURI
16
2. Causes of Human deafness include continuous exposure to loud noises. This is commonly
observed in people working in construction sites, airports and nightclubs. This is also
experienced by people working with firearms and heavy equipment, and those who use music
headphones frequently. The longer the exposure, the greater is the chance of getting affected by
hearing loss and deafness.
3. Some diseases and disorders can also be a contributory factor for deafness in humans. This
includes measles, meningitis, some autoimmune diseases like Wegener's granulomatosis,
mumps, presbycusis, AIDS and Chlamydia. Fetal alcoholic syndrome, developed in babies
born to alcoholic mothers, can cause hearing loss in infants. Growing adenoids can also cause
hearing loss by obstructing the Eustachian tube. Otosclerosis, which is a disorder of the middle
ear bone, is another cause of hearing loss and deafness. Likewise, there are many other medical
conditions which can cause deafness in humans.
4. Some medications are also considered to be the cause of permanent hearing loss in humans,
while others can lead to deafness which can be reversed. The former category includes
medicines like gentamicin and the latter includes NSAIDs, diuretics, aspirin and macrolide
antibiotics. Narcotic pain killer addiction and heavy hydrocodone abuse can also cause
deafness.
5. Human deafness causes include exposure to some industrial chemicals as well. These ototoxic
chemicals can contribute to hearing loss if combined with continuous exposure to loud noise.
These chemicals can damage the cochlea and some parts of the auditory system.
6. Sometimes, loud explosions can cause deafness in humans. Head injury is another cause for
deafness in humans.
The above are some of the common causes of deafness in humans. There can be many other
reasons which can lead to deafness or hearing loss in humans. It is always advisable to protect the
ears from trauma and other injuries, and to wear protective gear in workplaces, where there are
continuous heavy noises.
1.2 NEED FOR THE SYSTEM
YELLAPU MADHURI
17
Deaf communities revolve around sign languages as they are their natural means of
communication. Although deaf, hard of hearing and hearing signers can communicate without
problems amongst themselves, there is a serious challenge for the deaf community in trying to
integrate into educational, social and work environments. An important problem is that there are
not enough sign-language interpreters. In India, there are 60 million Deaf people (who use a sign
language), although there are more people with hearing deficiencies, but only 7000 sign-language
interpreters, i.e. a ratio of 893 deaf people to 1 interpreter. This information shows the need to
develop automatic translation systems with new technologies for helping hearing and Deaf people
to communicate between themselves.
1.3 AVAILABLE MODELS
Previous approaches have focused on recognizing mainly the hand alphabet which is used to
finger spell words and complete signs which are formed by dynamic hand movements. So far body
language and facial expressions have been left out. Hand gesture recognition can be achieved by
two ways: video-based and instrumented.
The video based systems allow the signer to move freely without any instrumentation
attached to the body. The hand shape, location and movement are recognized by cameras. But the
signer is constrained to sign in a controlled environment. The amount of data to be processed in the
image imposes a restriction on memory, speed and complexity on the computer equipment.
For instrumented approaches require sensors to be placed on signer’s hands. They are
restrictive and cumbersome but more successful in recognizing hand gestures than video based
approaches.
1.4 PROBLEM DEFINITION
Sign language is very complex with many actions taking place both sequentially and
simultaneously. Existing translators are bulky, slow and not precise due to the heavy parallel
processing required. The cost of these translators is usually very high due to the hardware required
to meet the processing demands. There is an urgent requirement for a simple, precise and
YELLAPU MADHURI
18
inexpensive system that helps to bridge the gap for normal people who do not know sign language
and deaf persons who communicate through sign language, who are unfortunately in significantly
large number in a country such as India.
In this project, the aim is to detect single-hand gestures in two dimensional space, using a
vision based system and speech input through microphone. The selected features should be as
small as possible in number, invariant to input errors like vibrating hand, small rotation, scale,
pitch and voice which may vary from person to person or with different input devices and provide
audio output through speaker and visual output on display device. The acceptable delay of the
system is the end of each gesture, meaning that the pre-processing should be in real-time. One of
our goals in the design and development of this system is scalability in detecting a reasonable
number of gestures and words and the ability to add new gestures and words in the future.
1.5 SCOPE OF THE PROJECT
The developed software is a stand alone application. It can be installed in any standard PC or
IOS phone and implemented. It can be used in a large variety of environments like shops,
governmental offices and also for the communication between a deaf user and information systems
like vendor machines or PCs. Below are the scopes that to be proposed for this project.
For sign to speech translation:
i. To develop an image acquisition system that automatically acquires images when triggered, for a
fixed interval of time or when the gestures are present.
ii. To develop a set of definition of gestures and processes of filtration, effect and function
available.
iii. To develop a pre-defined gestures algorithm that command computer to do playback function of
audio model.
iv. To develop a testing system that proceeds to command if the condition is true with the
processed images.
vi. To develop a simple Graphical User Interface for input and indication purposes.
YELLAPU MADHURI
19
For speech to sign translation:
i. To develop a speech acquisition system that automatically acquires speech input when triggered,
for a fixed interval of time or when the gestures are present.
ii. To develop a set of definition of phonemes and processes of filtration, effect and function
available.
iii. To develop a pre-defined phonetics algorithm that command computer to display function of
sign model.
iv. To develop a testing system that proceeds to command if the condition is true with the
processed phonemes.
vi. To develop a simple Graphical User Interface for input and indication purposes.
1.6 FUTURE PROSPECTS
For sign language to spoken English translation, the software is able to translate only static
signs to spoken English currently. It can be extended to translate dynamic signs. Also facial
expressions and body language can be tracked and considered which improves the performance of
the sign language to spoken English translation. For spoken English to sign language translation,
the system can be made user voice specific to eliminate the system response to non user.
1.7 ORGANISATION OF REPORT
This report is composed of 6 chapters each will give of details upon every aspect of this
project. The beginning of this report will explain on what foundation the system to be built on. This
includes Chapter 1 as the introduction of the whole report. The preceding chapter 2 will contain
discussion on related work. Next, chapter 3 will provide the aim and objective of the project.
Chapter 4 will explain the materials and methodology followed to achieve the aim and objective of
the project and have a complete setup application. This chapter starts with overview on key
component of software, hardware and how both should cooperate. Then it is followed with a
further look on the overall system built. These topics will detail out everything under the interest of
the system. The chapter 5 starts with results and discussion of the system and its performance with
YELLAPU MADHURI
20
the results of various stages of implementation. This report will properly be concluded in the last
Chapter 6. The conclusion discusses briefly what the proposed system has accomplished and
provides an outlook for future work recommended for the extension of this project and future
prospect for the development and improvement to grow on.
2. AIM AND OBJECTIVES OF THE PROJECT
2.1. AIM
To develop a mobile interactive application program for automatic translation of Indian sign
language into spoken English and vice-versa to assist the communication between Deaf people and
hearing people. The sign language translator should be able to translate one handed Indian Sign
language finger spelling input of alphabets (A-Z) and numbers (1-9) to spoken English audio
output and 165 spoken English word input to Indian Sign language picture display output.
2.2. OBJECTIVES
• To acquire one hand finger spelling of alphabets (A to Z) and numbers (1 to 9) to produce
spoken English audio output.
• To acquire spoken English word input to produce Indian Sign language picture display output.
• To create an executable file to make the software a standalone application.
• To implement the software and optimize the parameters to improve the accuracy of translation.
• To minimize hardware requirements and thus expense while achieving high precision of
translation.
YELLAPU MADHURI
21
3. MATERIALS AND METHODOLOGY
This chapter will be dedicated to explain the system in great details from setting up, to the
system component to the output. The software is developed in Virtual Instrumentation Labview
platform. The software consists of two main parts namely Sign language to speech translation and
speech to Sign language translation. The software can be implemented using a standard laptop,
desktop or an IOS mobile phone to operate with the camera, processor and audio device.
YELLAPU MADHURI
22
Figure 3.1 Graphical abstract
The software consists of four modules that can be implemented from a single window. The
necessary steps to implement these modules from a single window are explained in detail below.
3.1 SIGN LANGUAGE TO SPOKEN ENGLISH TRANSLATION
The sign language to spoken English translation is achieved using pattern matching
technique. The complete interactive section can be considered to be comprised of two layers:
detection and recognition. The detection layer is responsible for defining and extracting visual
features that can be attributed to the presence of hands in the field of view of the camera. The
YELLAPU MADHURI
23
recognition layer is responsible for grouping the spatiotemporal data extracted in the previous
layers and assigning the resulting groups with labels associated to particular classes of gestures.
3.1.1 DETECTION
The primary step in gesture recognition systems is the detection of hands and the
corresponding image regions. The step is crucial because it isolates the task-relevant data from the
image background, before passing them to the subsequent tracking and recognition stages. A large
number of methods have been proposed in the literature that utilize a several types of visual
features and, in many cases, their combination. Such features are skin color, shape, motion and
anatomical models of hands are used. Several color spaces have been proposed including RGB,
normalized RGB, HSV, YCrCb, YUV, etc. Color spaces efficiently separating the chromaticity
from the luminance components of color are typically considered preferable. This is due to the fact
that by employing chromaticity-dependent components of color only, some degree of robustness to
illumination changes can be achieved. Template based detection is used here. Members of this class invoke the hand detector at the
spatial vicinity that the hand was detected in the previous frame, so as to drastically restrict the
image search space. The implicit assumption for this method to succeed is that images are acquired
frequently enough. The proposed technique is explained in the following intermediate steps
namely; image acquisition, image processing, template preparation and pattern recognition.
3.1.2 IMAGE ACQUISITION
The software is installed in any supporting operating system with access to camera and
microphone. After installing the executable file, follow the instructions that appear on the
Graphical User Interface (GUI) and execute the program. The program allows the user to choose
the camera. All the cameras which are allowed to access through the operating system, either
inbuilt camera or connected externally appear in the selection list. After choosing the camera, the
software sends commands to the camera to capture the gestures of sign language performed by the
user. Image acquisition process is subjected to many environmental concerns such as the position
YELLAPU MADHURI
24
of the camera, lighting sensitivity and background condition. The camera is placed to focus on an
area that can capture the maximum possible movement of the hand and take into account the
difference in height of individual signers. Sufficient lighting is required to ensure that the acquired
image is bright enough to be seen and analyzed. Capturing thirty frames per second (fps) is found
to be sufficient. Higher fps will only lead to higher computation time of the computer as more input
data to be processed. As the acquisition process runs at real time, this part of the process has to be
efficient. The acquired images are then processed. The previous frame that has been processed will
be automatically deleted to free the limited memory space in the buffer.
3.1.3 IMAGE PROCESSING
The captured images are processed to identify the unique features of each sign. Image
processing enhances the features of interest for recognition of the sign. The camera captures
images at 30 frames per second. At this rate, the difference between subsequent images will be too
small. Hence, the images are sampled at 5 frames per second. In the program, one frame is saved
and numbered sequentially every 200 milliseconds so that the image classifying and processing can
be done systematically. The position of the hand is monitored. The image acquisition runs
continuously until the acquisition is stopped. The image processing involves performing
morphological operations on the input images to enhance the unique features of each sign. As the
frames from acquisition are read one by one, they are subjected to extraction of single color plane
of luminance.
3.1.4 TEMPLATE PREPARATION
The images to be used for pattern matching as templates are prepared using the following
procedure and are saved in a folder to be later used in the pattern matching.
1. Open Camera
Open the camera, query the camera for its capabilities, load the camera configuration file, and
create a unique reference to the camera.
2. Configure Acquisition
YELLAPU MADHURI
25
Configure a low-level acquisition previously opened with IMAQdx Open Camera VI. Specify
the acquisition type with the Continuous and Number of Buffers parameters. Snap: Continuous
= 0; Buffer Count = 1
Sequence: Continuous = 0; Buffer Count > 1
Grab: Continuous = 1; Buffer Count > 1
3. Start Acquisition
Start an acquisition that was previously configured with the IMAQdx Configure Acquisition.
4. Create
Create a temporary memory location for an image.
5. Get Image
Acquire the specified frame into Image Out. If the image type does not match the video format
of the camera, this VI changes the image type to a suitable format.
6. Extract Single Color Plane
Extract a single plane from the color image.
7. Setup Learn Pattern
Sets parameters used during the learning phase of pattern matching.
8. Learn Pattern
Create a description of the template image for which you want to search during the matching
phase of pattern matching. This description data is appended to the input template image.
During the matching phase, the template descriptor is extracted from the template image and
used to search for the template in the inspection image.
YELLAPU MADHURI
26
.
Figure 3.2 Flow diagram of template preparation
9. Write File 2
YELLAPU MADHURI
27
Write the image to a file in the selected format.
10. Close Camera
Stop the acquisition in progress, release resources associated with the acquisition, and close the
specified Camera Session.
11. Merge Errors Function
Merge error I/O clusters from different functions. This function looks for errors beginning with
the error in 0 parameter and reports the first error found. If the function finds no errors, it looks
for warnings and returns the first warning found. If the function finds no warnings, it returns no
error.
3.1.5 IMAGE RECOGNITION
The last stage of sign language to spoken English translation is the recognition stage and
providing the audio output. The techniques used for feature extraction should find shapes reliably
and robustly irrespective of changes in illumination levels, position, orientation and size of the
object in a video. Objects in an image are represented as collection of pixels. For object recognition
we need to describe the properties of these groups of pixels. The description of an object is a set of
numbers called as object’s descriptors. Recognition is simply matching a set of shape descriptors
from a set of known descriptors. A usable descriptor should possess four valuable properties. The
descriptors should form a complete set, they should be congruent, rotation invariant and form a
compact set. Objects in an image are characterized by two forms of descriptors region and shape
descriptors. Region descriptors describe the arrangement of pixels within the object area whereas
shape descriptors are the arrangement of pixels in the boundary of the object.
Template matching, a fundamental pattern recognition technique, has been utilized for
gesture recognition. The template matching is performed by the pixel-by-pixel comparison of a
prototype and a candidate image. The similarity of the candidate to the prototype is proportional to
the total score on a preselected similarity measure. For the recognition of hand postures, the image
of a detected hand forms the candidate image which is directly compared with prototype images of
hand postures. The best matching prototype (if any) is considered as the matching posture.
The final stage of the system is classification of different signs and generating voice
YELLAPU MADHURI
28
messages corresponding to the correctly classified sign. The acquired images which are
preprocessed are read one by one and compared with template images saved in database for pattern
matching. The pattern matching parameters are threshold value setting for the maximum difference
between the input sign and the database, if the difference is below the maximum limit, a match is
found and the sign is recognized. Ideally it is set at 800. When an input image is matched with a
template image, then the pattern matching loop will stop. The audio corresponding to the loop
iteration value is used to play the audio through the inbuilt audio device. The necessary steps to
achieve sign language to speech translation are given below.
1. Invoke Node Invoke a method or action on a reference.
2. Default Values: Reinitialize All To Default Method Change the current values of all controls on the front panel to their defaults.
3. File Dialog Express
Displays a dialog box with which you can specify the path to a file or directory from existing
files or directories or to select a location and name for a new file or directory.
4. Open Camera Opens a camera, queries the camera for its capabilities, loads a camera configuration file, and
creates a unique reference to the camera.
5. Configure Grab Configures and starts a grab acquisition. A grab performs an acquisition that loops continually
on a ring of buffers.
6. Create
Create a temporary memory location for an image.
7. Grab
Acquire the most current frame into Image Out.
8. Extract Single Color Plane
Extract a single plane from a color image.
9. Setup Learn Pattern
Sets parameters used during the learning phase of pattern matching.
YELLAPU MADHURI
29
Figure 3.3 Flow diagram of pattern matching
YELLAPU MADHURI
30
10. Learn Pattern
Create a description of the template image for which you want to search during the matching
phase of pattern matching. This description data is appended to the input template image.
During the matching phase, the template descriptor is extracted from the template image and
used to search for the template in the inspection image.
11. Recursive File List
List the contents of a folder or LLB.
12. Unbundle By Name Function
Returns the cluster elements whose names you specify.
13. Read File
Read an image file. The file format can be a standard format (BMP, TIFF, JPEG, JPEG2000,
PNG, and AIPD) or a nonstandard format known to the user. In all cases, the read pixels are
converted automatically into the image type passed by Image.
14. Call Chain
Return the chain of callers from the current VI to the top-level VI. Element 0 of the call chain
array contains the name of the lowest VI in the call chain. Subsequent elements are callers of
the lower VIs in the call chain. The last element of the call chain array is the name of the top-
level VI.
15. Index Array
Return the element or sub-array of n-dimension array at index.
16. Format Into String
Formats string, path, enumerated type, time stamp, Boolean, or numeric data as text.
17. Read Image And Vision Info
Read an image file, including any extra vision information saved with the image. This includes
overlay information, pattern matching template information, calibration information, and
custom data, as written by the IMAQ Write Image and Vision Info File 2 instance of the IMAQ
Write File 2 VI.
18. Pattern Match Algorithm
Check for the presence of the template image in the given input image.
19. Speak Text
Call the .NET speech synthesizer to speak a string of text.
YELLAPU MADHURI
31
20. Dispose
Destroys an image and frees the space it occupied in memory. This VI is required for each
image created in an application to free the memory allocated to the IMAQ Create VI.
21. Simple Error Handler
Indicate whether an error occurred. If an error occurred, this VI returns a description of the
error and optionally displays a dialog box.
Figure 3.4 Block diagram of sign to speech translation
3.2 SPEECH TO SIGN LANGUAGE TRANSLATION
The speech input is acquired through the inbuilt microphone using windows speech
recognition software. The system recognizes the speech input phrases that are listed in the
database. Each phrase in the database is associated with a picture of sign language gesture. If the
input speech matches the database then a command is sent to display the corresponding gesture.
The necessary steps to achieve speech to sign language translation are given below.
YELLAPU MADHURI
32
1. Current VI’s path
Return the path to the file of the current VI.
2. Strip Path
Return the name of the last component of a path and the stripped path that leads to that
component.
3. Build path
Create a new path by appending a name (or relative path) to an existing path.
4. VI Server Reference
Return a reference to the current VI or application, to a control or indicator in the VI, or to a
pane. You can use this reference to access the properties and methods for the associated VI,
application, control, indicator, or pane.
5. Property Node
Get (reads) and/or set (writes) properties of a reference. Use the property node to get or set
properties and methods on local or remote application instances, VIs, and objects.
6. Speech Recognizer Initialize
The event rises when the current grammar has been used by the recognition engine to detect
speech and find one or more phrases with sufficient confidence levels.
7. Event Structure
Have one or more sub diagrams, or event cases, exactly one of which executes when the
structure executes. The Event structure waits until an event happens, then executes the
appropriate case to handle that event.
8. Read JPEG File VI
Read the JPEG file and create the data necessary to display the file in a picture control.
9. Draw Flattened Pixmap VI
Draw a 1-, 4-, or 8-bit pixmap or a 24-bit RGB pixmap into a picture.
10. 2D Picture Control
Include a set of drawing instructions for displaying pictures that can contain lines, circles, text,
and other types of graphic shapes.
11. Simple Error Handler VI
Indicate whether an error occurred. If an error occurred, this VI returns a description of the
error and optionally displays a dialog box.
YELLAPU MADHURI
33
Figure 3.5 Flow diagram speech to sign translation
YELLAPU MADHURI
34
Figure 3.6 Block diagram speech to sign translation
3.2.1 SPEECH RECOGNITION
A speech recognition system consists of the following:
• A microphone, for the person to speak into.
• Speech recognition software.
• A computer to take and interpret the speech.
• A good quality soundcard for input and/or output.
Voice-recognition software programs work by analyzing sounds and converting them to text.
They also use knowledge of how English is usually spoken to decide what the speaker most
probably said. Once correctly set up, the systems should recognize around 95% of what is said if
you speak clearly. Several programs are available that provide voice recognition. These systems
have mostly been designed for Windows operating systems; however programs are also available
for Mac OS X. In addition to third-party software, there are also voice-recognition programs built
in to the operating systems of Windows Vista and Windows 7. Most specialist voice applications
include the software, a microphone headset, a manual and a quick reference card. You connect the
microphone to the computer, either into the soundcard (sockets on the back of a computer) or via a
USB or similar connection. The latest versions of Microsoft Windows have a built-in voice-
recognition program called Speech Recognition. It does not have as many features as Dragon
YELLAPU MADHURI
35
NaturallySpeaking but does have good recognition rates and is easy to use. As it is part of the
Windows operating system, it does not require any additional cost apart from a microphone.
The input voice recognition is achieved using windows 7 inbuilt Speech Recognition software.
When the program is started, instructions appear to setup the microphone and a tutorial begins
which gives steps to proceed for user voice recognition.
A computer doesn't speak your language, so it must transform your words into something it
can understand. A microphone converts your voice into an analog signal and feeds it to your PC's
sound card. An analog-to-digital converter takes the signal and converts it to a stream of digital
data (ones and zeros). Then the software goes to work. While each of the leading speech
recognition companies has its own proprietary methods, the two primary components of speech
recognition are common across products. The first piece, called the acoustic model, analyzes the
sounds of your voice and converts them to phonemes, the basic elements of speech. The English
language contains approximately 50 phonemes.
Here's how it breaks down your voice: First, the acoustic model removes noise and unneeded
information such as changes in volume. Then, using mathematical calculations, it reduces the data
to a spectrum of frequencies (the pitches of the sounds), analyzes the data, and converts the words
into digital representations of phonemes. The software operation is as explained below.
Figure 3.7 Speech recognizer tutorial window
YELLAPU MADHURI
36
3.2.2 STEPS TO VOICE RECOGNITION
• ENROLMENT
Everybody’s voice sounds slightly different, so the first step in using a voice-recognition
system involves reading an article displayed on the screen. This process, called enrolment, takes
less than 10 minutes and results in a set of files being created which tell the software how you
speak. Many of the newer voice-recognition programs say this is not required, however it is still
worth doing to get the best results. The enrolment only has to be done once, after which the
software can be started as needed.
• DICTATING AND CORRECTING
When talking, people often hesitates, mumble or slur their words. One of the key skills in
using voice-recognition software is learning how to talk clearly so that the computer can recognize
what you are saying. This means planning what to say and then speaking in complete phrases or
sentences. The voice-recognition software will misunderstand some of the words spoken, so it is
necessary to proofread and then correct any mistakes. Corrections can be made by using the mouse
and keyboard or by using your voice. When you make corrections, the voice-recognition software
will adapt and learn, so that (hopefully) the same mistake will not occur again. Accuracy should
improve with careful dictation and correction.
• INPUT
The first step in voice recognition (VR) is the input and digitization of the voice into VR-
capable software. This generally happens via an active microphone plugged into the computer. The
user speaks into the microphone, and an analog-to-digital converter (ADC) creates digital sound
files for the VR program to work with.
YELLAPU MADHURI
37
• ANALYSIS
The key to VR is in the speech analysis. VR programs take the digital recording and parse it
into small, recognizable speech bits called "phonemes," via high-level audio analysis software.
(There are approximately 40 of these in the English language.)
• SPEECH-TO-TEXT
Once the program has identified the phonemes, it begins a complex process of identification
and contextual analysis, comparing each string of recorded phonemes against text equivalents in its
memory. It then accesses its internal language database and pairs up the recorded phonemes with
the most probable text equivalents.
• OUTPUT
Finally, the VR software provides a word output to the screen, mere moments after speaking. It
continues this process, at high speed, for each word spoken into its program. Speech recognition
fundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audio
from a sound card into recognized speech. The elements of the pipeline are:
1. Transform the PCM digital audio into a better acoustic representation.
2. Apply a "grammar" so the speech recognizer knows what phonemes to expect. A grammar
could be anything from a context-free grammar to full-blown Language.
3. Figure out which phonemes are spoken.
4. Convert the phonemes into words.
• TRANSFORM THE PCM DIGITAL AUDIO
The first element of the pipeline converts digital audio coming from the sound card into a
format that's more representative of what a person hears. The wave format can vary. In other
words, it may be 16 KHz 8 bit Mono/Stereo or 8 KHz 16 bit mono, and so forth. It's a wavy line
YELLAPU MADHURI
38
that periodically repeats while the user is speaking. When in this form, the data isn't useful to
speech recognition because it's too difficult to identify any patterns that correlate to what was
actually said. To make pattern recognition easier, the PCM digital audio is transformed into the
"frequency domain." Transformations are done using a windowed Fast-Fourier Transform (FFT).
The output is similar to what a spectrograph produces. In frequency domain, you can identify the
frequency components of a sound. From the frequency components, it's possible to approximate
how the human ear perceives the sound.
The FFT analyzes every 1/100th of a second and converts the audio data into the frequency
domain. Each 1/100th of second's results are a graph of the amplitudes of frequency components,
describing the sound heard for that 1/100th of a second. The speech recognizer has a database of
several thousand such graphs (called a codebook) that identify different types of sounds the human
voice can make. The sound is "identified" by matching it to its closest entry in the codebook,
producing a number that describes the sound. This number is called the "feature number."
(Actually, there are several feature numbers generated for every 1/100th of a second, but the
process is easier to explain assuming only one.) The input to the speech recognizer began as a
stream of 16,000 PCM values per second. By using Fast-Fourier Transforms and the codebook, it is
boiled down into essential information, producing 100 feature numbers per second.
• FIGURE OUT WHICH PHONEMES ARE SPOKEN
To figure out which phonemes are spoken the following procedure is used.
• Start by grouping. To make the recognition process easier to understand, you first should know
how the recognizer determines what phonemes were spoken and then understand the grammars.
• Every time a user speaks a word, it sounds different. Users do not produce exactly the same
sound for the same phoneme.
• The background noise from the microphone and user's office sometimes causes the recognizer
to hear a different vector than it would have if the user were in a quiet room with a high-quality
microphone.
YELLAPU MADHURI
39
• The sound of a phoneme changes depending on what phonemes surround it. The "t" in "talk"
sounds different than the "t" in "attack" and "mist."
The sound produced by a phoneme changes from the beginning to the end of the phoneme,
and is not constant. The beginning of a "t" will produce different feature numbers than the end of a
"t."The background noise and variability problems are solved by allowing a feature number to be
used by more than just one phoneme, and using statistical models to figure out which phoneme is
spoken. This can be done because a phoneme lasts for a relatively long time, 50 to 100 feature
numbers, and it's likely that one or more sounds are predominant during that time. Hence, it's
possible to predict what phoneme was spoken.
The speech recognizer needs to know when one phoneme ends and the next begins. Speech
recognition engines use a mathematical technique called "Hidden Markov Models" (HMMs) that
figure this out. The speech recognizer figures out when speech starts and stops because it has a
"silence" phoneme, and each feature number has a probability of appearing in silence, just like any
other phoneme. Now, the recognizer can recognize what phoneme was spoken if there's
background noise or the user's voice had some variation. However, there's another problem. The
sound of phonemes changes depending upon what phoneme came before and after. You can hear
this with words such as "he" and "how". You don't speak a "h" followed by an "ee" or "ow," but the
vowels intrude into the "h," so the "h" in "he" has a bit of "ee" in it, and the "h" in "how" as a bit of
"ow" in it.
Speech recognition engines solve the problem by creating "tri-phones," which are phonemes
in the context of surrounding phonemes. Thus, there's a tri-phone for "silence-h-ee" and one for
"silence-h-ow." Because there are roughly 50 phonemes in English, you can calculate that there are
50*50*50 = 125,000 tri-phones. That's just too many for current PCs to deal with so similar
sounding tri-phones are grouped together.
The sound of a phoneme is not constant. A "t" sound is silent at first, and then produces a
sudden burst high frequency of noise, which then fades to silence. Speech recognizers solve this by
splitting each phoneme into several segments and generating a different segment for each segment.
The recognizer figures out where each segment begins and ends in the same way it figures out
where a phoneme begins and ends.
YELLAPU MADHURI
40
A speech recognizer works by hypothesizing a number of different "states" at once. Each
state contains a phoneme with a history of previous phonemes. The hypothesized state with the
highest score is used as the final recognition result.
When the speech recognizer starts listening, it has one hypothesized state. It assumes the user
isn't speaking and that the recognizer is hearing the "silence" phoneme. Every 1/100th of a second,
it hypothesizes that the user has started speaking, and adds a new state per phoneme, creating 50
new states, each with a score associated with it. After the first 1/100th of a second, the recognizer
has 51 hypothesized states.
In 1/100th of a second, another feature number comes in. The scores of the existing states are
recalculated with the new feature. Then, each phoneme has a chance of transitioning to yet another
phoneme, so 51 * 50 = 2550 new states are created. The score of each state is the score of the first
1/100th of a second times the score if the 2nd 1/100th of a second. After 2/100ths of a second, the
recognizer has 2601 hypothesized states.
This same process is repeated every 1/100th of a second. The score of each new hypothesis is
the score of its parent hypothesis times the score derived from the new 1/100th of a second. In the
end, the hypothesis with the best score is what's used as the recognition result.
• ADAPTATION
Speech recognition system "adapt" to the user's voice, vocabulary, and speaking style to
improve accuracy. A system that has had time enough to adapt to an individual can have one fourth
the error rate of a speaker-independent system. Adaptation works because the speech recognition is
often informed (directly or indirectly) by the user if it's recognition was correct, and if not, what the
correct recognition is.
The recognizer can adapt to the speaker's voice and variations of phoneme pronunciations in
a number of ways. First, it can gradually adapt the codebook vectors used to calculate the acoustic
feature number. Second, it can adapt the probability that a feature number will appear in a
phoneme. Both of these are done by weighted averaging.
YELLAPU MADHURI
41
The language model also can be adapted in a number of ways. The recognizer can learn new
words, and slowly increase probabilities of word sequences so that commonly used word sequences
are expected. Both these techniques are useful for learning names.
Although not common, the speech recognizer can adapt word pronunciations in its lexicon.
Each word in a lexicon typically has one pronunciation. The word "orange" might be pronounced
like "or-anj." However, users will sometimes speak "ornj" or "or-enj." The recognizer can
algorithmically generate hypothetical alternative pronunciations for a word. It then listens for all of
these pronunciations during standard recognition, "or-anj," "or-enj," "or-inj," and "ornj." During
the process of recognition, one of these pronunciations will be heard, although there's a fair chance
that the recognizer heard a different pronunciation than what the user spoke. However, after the
user has spoken the word a number of times, the recognizer will have enough examples that it can
determine what pronunciation the user spoke.
However, speech recognition (by a machine) is a very complex problem. Vocalizations vary
in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed.
Speech is distorted by a background noise and echoes, electrical characteristics. Accuracy of
speech recognition varies with the following:
• Vocabulary size and confusability
• Speaker dependence vs. independence
• Isolated, discontinuous, or continuous speech
• Task and language constraints
• Read vs. spontaneous speech
• Adverse conditions
YELLAPU MADHURI
42
4 RESULTS AND DISCUSSIONS
4.1 RESULTS
In this section we will analyze the performance of the system by its capability to recognize
gestures from images. We also discuss the difficulties faced while designing the system.
4.1.1 APPLICATION
The software is a standalone application. To install the file, follow the instructions that
appear in the executable installer file.
Figure 4.1 Application Installer
After installing the application, a Graphical user interfacing window opens, from which the
full application can be used. The graphical User Interface (GUI) has been created to run the entire
application from a single window. The GUI has four pages namely page 1, page 2, page 3 and page
4, each page corresponds to a specific application.
YELLAPU MADHURI
43
• Page 1 gives a detailed demo of the total software usage.
• Page 2 is for speech to sign language translation.
• Page 3 is for template preparation for sign to speech translation.
• Finally page 4 is for Sign to speech translation.
Figure 4.2 Application window
The functions of the various buttons that appear on the window are as explained below.
To run the application
To stop the application
To go to previous page
To go to next page
YELLAPU MADHURI
44
4.1.1.1 PAGE 1
This page consists of detailed instructions to execute the entire application. To continue to
the specific application use the previous and next buttons.
4.1.1.2 PAGE 2
This page consists of Speech to Sign language translator. The window appearance is as
shown in figure 4.3. The working of this module is as explained below.
Figure 4.3 GUI of Speech to Sign translation
Building a speech recognition program is not in the scope of this project. Instead integrating
an existing speech recognition engine into the program can be very easy. When the “start” button is
YELLAPU MADHURI
45
pressed, a command is sent to the Windows 7 inbuilt Speech Recognizer and it opens a mini
window at the top. The first time it is started, a tutorial session begins which gives instructions to
setup the microphone and recognize the user’s voice input. Configure the speech recognition
software. In order for your application to be able to take full advantage of speech recognition, the
speech recognition program must be correctly configured. This means that microphone and
language settings must be set appropriately to take optimal advantage of the speech recognition
program's capabilities.
Voice recognition training teaches the software to recognize your voice and speech patterns.
Training involves reading the given paragraphs or single words into the software using a
microphone. The more you repeat the process, the more the program should accurately transcribe
your speech.
According to a Landmark College article, most people get frustrated with the training process
and feel it's too time consuming. Before you decide to skip training, you should think about the
consequences. The software will incorrectly transcribe your speech more often than not, which will
make the software less efficient.
Speaking clearly and succinctly during training makes it easier for the software to recognize
your voice. As a result, you'll spend less time training, repeating yourself and correcting the
program. It also helps to use a good-quality microphone that easily registers your voice. Speech
recognition package also tune itself to the individual user. The software customizes itself based on
your voice, your unique speech patterns, and your accent. To improve dictation accuracy, it creates
a supplementary dictionary of the words you use.
After the initial training, from the next time the program is executed, it starts speech
recognition automatically. To train the system for a different user or change the microphone
settings, right click on the Speech Recognizer window and select “Start Speech Tutorial”.
To stop the speech recognition software select the icon or say “Stop listening”. The
Speech recognizer will go to sleep mode.
YELLAPU MADHURI
46
Figure 4.4 Speech recognizer in sleep mode.
To start speech recognition again select the icon or say “Start Listening”. The Speech
recognizer will go to active mode.
Figure 4.5 Speech recognizer in active mode.
If the user’s speech input is not clear then it asks to repeat the input again.
Figure 4.6 Speech recognizer when input speech is not clear for recognition.
The active working mode appearance of the Speech to Sign language translator module
window is as shown in figure 4.7. When the user utters any of the words listed in the “Phrases”
near the microphone, the input sound is processed for recognition. If the input sound matches the
words in the database, it is displayed in the “Command” alphanumeric indicator. A sign language
gesture picture corresponding to the speech input is displayed in the “Sign” picture indicator. Also
the score of speech input correlation with the trained word is displayed in the “Score” numeric
indicator. Use the exit button to exit the application of speech to sign language translation. To
extend the application to translate more input spoken English words to Sign language picture
display output, simply include the sign language images in the folder “Sign Images” and add the
word to the list in the “Phrases”.
YELLAPU MADHURI
47
Figure 4.7 GUI of working window of speech to sign translation
Figure 4.8 Block diagram of speech to sign translation
YELLAPU MADHURI
48
4.1.1.3 PAGE 3
Figure 4.9 GUI of template preparation
Figure 4.10 Block diagram of sign to speech translation
YELLAPU MADHURI
49
This page consists of template preparation setup for Sign language to Speech translator. The
window appearance is as shown in figure 4.11. The working of this module is as explained below.
To execute the template preparation module for Sign language to speech translation, press the
“Start” button. Choose the camera to acquire images to be used as templates, from the “Camera
Name” list. The acquired image is displayed on “Image” picture indicator. If the display image is
good to be used for preparing a template, press “Snap frame”. The snapped image is displayed on
“Snap Image” picture display. Draw a region of interest to prepare the template and press “Learn”.
The image region in the selected portion of the snapped frame is saved to the folder specified for
templates. The saved template image is displayed on “Template Image” picture display. Press
“Stop” button to stop execution of template preparation module.
Figure 4.11 GUI of working window of template preparation
YELLAPU MADHURI
50
4.1.1.4 PAGE 4
This page consists of Sign to Speech translator. When started it captures the signs performed
by the deaf user in real time and compares them with created template images and gives an audio
output when a match is found. The window appearance is as shown in figure 4.1.1.2. The working
of this module is as explained below.
Figure 4.12 GUI of sign to speech translation
Press the “Start” button to start the program. The “Camera Name” indicator displays the list of
all the cameras that are connected to the computer. Choose the camera from the list. Adjust the
selected camera position to capture the sign gestures performed by the user. For the performed test
the camera is fixed at a distance of one meter from the user’s hand. The captured images are displayed
on the “Input Image” picture display. Press the “Match” button to start comparing the acquired input
image with the template images in the data base. In every iteration, the input image is checked for
pattern match with one template. When the input image matches with the template image, the loop
YELLAPU MADHURI
51
halts. The “Match” LED glows and the matched template is displayed on the “Template Image”
indicator. If the input image does not match with any of the images from the database of templates,
then the audio output says “NONE” and the “Match” LED do not glow.
The loop iteration count is used for triggering a case structure. Depending on the iteration
count value a specific case is selected and gives a string output. Otherwise the loop continues to next
iteration where the input image is checked for pattern match with a new template. The information in
the string output from case structure is displayed on the “Matched Pattern” alphanumeric indicator. It
also initiates the .NET speech synthesizer to give an audio output through the speaker.
Figure 4.13 GUI of working window of sign to speech translation
To pause the pattern matching while the program is still running, press the “Match” button.
This makes the pattern matching step to go to inactive mode. The acquired image is just displayed
on the Input image indicator but does not go for pattern matching. To resume pattern matching
press the “match” button again. It is highlighted and indicates that it is in active mode.
YELLAPU MADHURI
52
Figure 4.14 Block diagram of sign to speech translation
Figure 4.15 Block diagram of pattern matching
YELLAPU MADHURI
53
For sign language to spoken English translation, the classification of different gestures is
done using pattern matching technique for 36 different gestures (Alphabets A to Z and numbers 1
to 9) of Indian sign language. The performance of the system is evaluated based on its ability to
correctly recognize signs to their corresponding speech class. The recognition rate is defined as the
ratio of the number of correctly classified signs to the total number of signs:
Recognition Rate= Number of Correctly Classified Signs × (%) 100
Total Number of Signs
The proposed approach has been assessed using input sequences containing user performing
various gestures in indoor environment for alphabets A to Z and numbers 1 to 9. This section will
present results obtained from a sequence depicting a person performing a variety of hand gestures
in a setup that is typical for deaf and normal person interaction applications. i.e the subject is sitting
at a typical distance of about 1m from the camera. The resolution of the sequence is 640* 480 and
it was obtained with a standard, low-end web camera at 30 frames per second.
The total number of signs used for testing is 36 and the system recognition rate is 100% for
inputs similar to database. The system was implemented with LABVIEW version 2012.
4.2 DISCUSSIONS
For Sign language to speech translation, the gesture recognition problem consists of pattern
representation and recognition. In the previous related works, hidden Markov model (HMM) is
used widely in speech recognition, and a number of researchers have applied HMM to temporal
gesture recognition. Yang and Xu (1994) proposed gesture-based interaction using a multi-
dimensional HMM. They used a Fast Fourier Transform (FFT) to convert input gestures to a
sequence of symbols to train the HMM. They reported 99.78% accuracy for detecting 9 gestures.
Watnabe and Yachida (1998) proposed a method of gesture recognition from image
sequences. The input image is segmented using maskable templates and then the gesture space is
constituted by Karhunen-Loeve (KL) expansion using the segment. They applied Eigen vector-
based matching for gesture detection.
YELLAPU MADHURI
54
Oka, Satio and Kioke (2002) developed a gesture recognition based on measured finger
trajectories for an augmented desk interface system. They used a Kalman-Filter for predicting the
location of multiple fingertips and HMM for gesture detection. They have reported average
accuracy of 99.2% for single finger gestures produced by one person. Ogawara et al. (2001)
proposed a method of constructing a human task model by attention point (AP) analysis. Their
target application was gesture recognition for human-robot interaction.
New et al. (2003) proposed a gesture recognition system for hand tracking and detecting the
number of fingers being held up to control an external device, based on hand-shape template
matching. Perrin et al. (2004) described a finger tracking gesture recognition system based on laser
tracking mechanism which can be used in hand-held devices. They have used HMM for their
gesture recognition system with an accuracy of 95% for 5 gesture symbols at a distance of 30cm to
their device.
Lementec and Bajcsy (2004) proposed an arm gesture recognition algorithm from Euler
angles acquired from multiple orientation sensors, for controlling unmanned aerial vehicles in
presence of manned aircrew. Dias et al. (2004) described their vision-based open gesture
recognition engine called OGRE, reporting detection and tracking of hand contours using template
matching with accuracy of 80% to 90%.
Because of the difficulty of data collection for training an HMM for temporal gesture
recognition, the vocabularies are very limited, and to reach to an acceptable accuracy, the process
is excessively data and time intensive. Some researchers have suggested that a better approach is
needed for use with more complex systems (Perrin et al., 2004).
This paper presents a novel approach for gesture detection. This approach has two main
steps: i) gesture template preparation, and ii) gesture detection. The gesture template preparation
technique which is presented here has some important features for gesture recognition including
robustness against slight rotation, small number of required features, invariant to the start position
and device independence. For gesture detection, a pattern matching technique is used. The results
of our first experiment show 99.72 % average accuracy in single gesture detection. Based on the
high accuracy of the gesture classification, the number of templates seems to be enough for
detecting a limited number of gestures. However, more accurate judgment requires a larger number
of gestures in the gesture-space to further validate this assertion.
YELLAPU MADHURI
55
The gesture recognition technique introduced in this article can be used with a variety of
front-end input systems such as vision based input , hand and eye tracking, digital tablet, mouse,
and digital glove. Much previous work has focused on isolated sign language recognition with
clear pauses after each sign, although the research focus is slowly shifting to continuous
recognition. These pauses make it a much easier problem than continuous recognition without
pauses between the individual signs, because explicit segmentation of a continuous input stream
into the individual signs is very difficult. For this reason, and because of co-articulation effects,
work on isolated recognition often does not generalize easily to continuous recognition.
But the proposed software captures the input images as an AVI sequence of continuous
images. This allows for continuous input image acquisition without pauses. But each image frame
is processed individually and checked for pattern matching. This technique overcomes the problem
of processing continuous images at the same time having input stream without pauses.
For Speech to Sign language translation words of similar pronunciation are sometimes
misinterpreted. This problem can be avoided by clearly pronouncing the words and with extended
training and increasing usage.
ALPHABET A
ALPHABET B
ALPHABET C
ALPHABET D
ALPHABET E
ALPHABET F
ALPHABET G
ALPHABET H
ALPHABET I
ALPHABET J
YELLAPU MADHURI
56
ALPHABET K
ALPHABET L
ALPHABET M
ALPHABET N
ALPHABET O
ALPHABET P
ALPHABET Q
ALPHABET R
ALPHABET S
ALPHABET T
ALPHABET U
ALPHABET V
ALPHABET W
ALPHABET X
ALPHABET Y
ALPHABET Z
Figure 4.16 Data base of sign templates
YELLAPU MADHURI
57
NUMBER 1
NUMBER 2
NUMBER 3
NUMBER 4
NUMBER 5
NUMBER 6
NUMBER 7
NUMBER 8
NUMBER 9
Figure 4.17 Data base of sign number templates
YELLAPU MADHURI
58
5 CONCLUSIONS AND FUTURE ENHANCEMENT
5.1 CONCLUSIONS
This sign language translator is able to translate alphabets (A-Z) and numbers (1-9). All the
signs can be translated real-time. But signs that are similar in posture and gesture to another sign
can be misinterpreted, resulting in a decrease in accuracy of the system. The current system has
only been trained on a very small database. Since there will always be variation in either the
signers hand posture or motion trajectory, to increase the performance and accuracy of the system,
the quality of the training database used should be enhanced to ensure that the system picks up
correct and significant characteristics in each individual sign and further improve the performance
more efficiently. A larger dataset will also allow experimenting further on performance in different
environments. Such a comparison will allow to tangibly measuring the robustness of the system in
changing environments and provide training examples for a wider variety of situations. Adaptive
color models and improved tracking could boost performance of the vision system.
Current collaboration with Assistive Technology researchers and members of the Deaf
community for continued design work is under progress. The gesture recognition technology is
only one component of a larger system that we hope to one day be an active tool for the Deaf
community.
This project did not focus on facial expressions although it is well known that facial
expressions convey important part of sign-languages. The facial expressions can e.g. be extracted
by tracking the signers’ face. Then, the most discriminative features can be selected by employing
a dimensionality reduction method and this cue could also be fused into the recognition system.
This system can be implemented in many application areas examples include accessing
government websites whereby no video clip for deaf and mute is available or filling out forms
whereby no interpreter may be present to help.
For the future work, there are many possible improvements that can extend this work. First of
all, more diversified hand samples from different people can be used in the training process so that
YELLAPU MADHURI
59
the system will be more user independent. The second improvement could be context-awareness
for the gesture recognition system. The same gesture performed within different contexts and
environments can have different semantic meanings. Another possible improvement is to track and
recognize multiple objects such as human faces, eye gaze and hand gestures at the same time. With
this multi-model based tracking and recognition strategy, the relationships and interactions among
these tracked objects can be defined and assigned with different semantic meanings so that a richer
command set can be covered. By integrating this richer command set with other communication
modalities such as speech recognition and haptic feedback, the Deaf user communication
interaction experience can be enriched greatly and be much more interesting.
The system developed in this work can be extended to many other research topics in the field
of computer vision and sign language translation techniques. We hope this project could trigger
more investigations to make translation systems see and think better.
5.2 FUTURE ENHANCEMENT
5.2.1 APPLICATIONS OF SIGN RECOGNITION
The sign language recognition can be used to assist the communication of Deaf persons to
interact efficiently with non sign language users without the intervention of an interpreter. It can be
installed at government organizations and other public services. It can be made to incorporate with
internet for live video conferences between deaf and normal people.
5.2.2 APPLICATIONS OF SPEECH RECOGNITION
There are a number of scenarios where speech recognition is either being delivered,
developed for, researched or seriously discussed. As with many contemporary technologies, such
as the Internet, online payment systems and mobile phone functionality, development is at least
partially driven by the trio of often perceived evils.
YELLAPU MADHURI
60
• COMPUTER AND VIDEO GAMES
Speech input has been used in a limited number of computer and video games, on a variety of
PC and console-based platforms, over the past decade. For example, the game Seaman24 involved
growing and controlling strange half-man half fish characters in a virtual aquarium. A microphone,
sold with the game, allowed the player to issue one of a pre-determined list of command words and
questions to the fish. The accuracy of interpretation, in use, seemed variable; during gaming
sessions colleagues with strong accents had to speak in an exaggerated and slower manner in order
for the game to understand their commands.
Microphone-based games are available for two of the three main video game consoles
(Playstation 2 and Xbox). However, these games primarily use speech in an online player to player
manner, rather than spoken words being interpreted electronically. For example, a MotoGP for the
Xbox allows online players to ride against each other in a motorbike race simulation, and speak
(via microphone headset) to the nearest players (bikers) in the race. There is currently interest, but
less development, of video games that interpret speech.
• PRECISION SURGERY
Developments in keyhole and micro surgery have clearly shown that an approach of as little
invasive or non-essential surgery as possible increases success rates and patient recovery times.
There is occasional speculation in various medical for a regarding the use of speech recognition in
precision surgery, where a procedure is partially or totally carried out by automated means.
For example, in removing a tumour or blockage without damaging surrounding tissue, a
command could be given to make an incision of a precise and small length e.g. 2 millimeters.
However, the legal implications of such technology are a formidable barrier to significant
developments in this area. If speech was incorrectly interpreted and e.g. a limb was accidentally
sliced off, who would be liable – the surgeon, the surgery system developers, or the speech
recognition software developers.
YELLAPU MADHURI
61
• DOMESTIC APPLICATIONS
There is inevitable, interest in the use of speech recognition in domestic appliances such as
ovens, refrigerators, dishwashers and washing machines. One school of thought is that, like the use
of speech recognition in cars, this can reduce the number of parts and therefore the cost of
production of the machine. However, removal of the normal buttons and controls would present
problems for people who, for physical or learning reasons, cannot use speech recognition systems.
• WEARABLE COMPUTERS
Perhaps the most futuristic application is in the use and functionality of wearable computers 25
i.e. unobtrusive devices that you can wear like a watch, or are even embedded in your clothes.
These would allow people to go about their everyday lives, but still store information (thoughts,
notes, to-do lists) verbally, or communicate via email, phone or videophone, through wearable
devices. Crucially, this would be done without having to interact with the device, or even
remember that it is there; the user would just speak, the device would know what to do with the
speech, and would carry out the appropriate task.
The rapid miniaturization of computing devices, the rapid rise in processing power, and
advances in mobile wireless technologies, are making these devices more feasible. There are still
significant problems, such as background noise and the idiosyncrasies of an individual’s language,
to overcome. However, it is speculated that reliable versions of such devices will become
commercially available during this decade.
YELLAPU MADHURI
62
REFERENCES
[1] Andreas Domingo, Rini Akmeliawati, Kuang Ye Chow ‘Pattern Matching for Automatic Sign
Language Translation System using LabVIEW’, International Conference on Intelligent and
Advanced Systems 2007.
[2] Beifang Yi Dr. Frederick C. Harris ‘A Framework for a Sign Language Interfacing System’, A
dissertation submitted in partial fulllment of the requirements for the degree of Doctor of
Philosophy in Computer Science and Engineering May 2006 University of Nevada, Reno.
[3] Fernando Lo´ pez-Colino n, Jose´ Cola´ s (2012), ‘Spanish Sign Language synthesis system’,
Journal of Visual Languages and Computing 23 (2012) 121–136.
[4] Helene Brashear & Thad Starner ‘Using Multiple Sensors for Mobile Sign Language
Recognition’, ETH - Swiss Federal Institute of Technology Wearable Computing Laboratory 8092
Zurich, Switzerland flukowicz, junker [email protected]
[5] Jose L. Hernandez-Rebollar1, Nicholas Kyriakopoulos1, Robert W. Lindeman2 ‘A New
Instrumented Approach For Translating American Sign Language Into Sound And Text’,
Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture
Recognition (FGR’04) 0-7695-2122-3/04 $ 20.00 © 2004 IEEE.
YELLAPU MADHURI
63
[6] K. Abe, H. Saito, S. Ozawa: Virtual 3D Interface System via Hand Motion Recognition From
Two Cameras. IEEE Trans. Systems, Man, and Cybernetics, Vol. 32, No. 4, pp. 536–540, July
2002.
[7] Paschaloudi N. Vassilia, Margaritis G. Konstantinos "Listening to deaf': A Greek sign language
translator’, 0-7803-9521-2/06/$20.00 §2006 IEEE.
[8] Rini Akmeliawatil, Melanie Po-Leen Ooi2, Ye Chow Kuang3 ‘Real-Time Malaysian Sign
Language Translation using Colour Segmentation and Neural Network’, IMTC 2007 -
Instrumentation and Measurement Technology Conference Warsaw, Poland, 1-3 May 2007.
[9] R. Bowden, D. Windridge, T. Kabir, A. Zisserman, M. Bardy: ‘A Linguaistic Feature Vector
for the Visual Interpretation of Sign Languag’, In Proceedings of ECCV 2004, the 8th European
Conference on Computer Vision, Vol. 1, pp. 391–401, Prague, Czech Republic, 2004.
[10] Ravikiran J, Kavi Mahesh, Suhas Mahishi, Dheeraj R, Sudheender S, Nitin V Pujari, (2009),
‘Finger Detection for Sign Language Recognition’, Proceedings of the International
MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20,
2009, Hong Kong.
[11] S. Akyol, U. Canzler, K. Bengler, W. Hahn: ‘Gesture Control for Use in Automobiles’, In
Proc. IAPR Workshop Machine Vision Applications, pp. 349–352, Tokyo, Japan, Nov. 2000.
YELLAPU MADHURI
64
[12] Verónica López-Lude˜na ∗, Rubén San-Segundo, Juan Manuel Montero, Ricardo Córdoba,
Javier Ferreiros, José Manuel Pardo, (2011), ‘Automatic categorization for improving Spanish into
Spanish Sign Language machine translation’, Computer Speech and Language 26 (2012) 149–167.
[13] Scientific understanding and vision-based technological development for continuous sign
language recognition and translation – www.signspeak.eu – FP7-ICT-2007-3-231424.