Speech To Sign Language Interpreter System

By:By: Khalid El-Darymli Khalid El-Darymli G0327887G0327887Supervisor:Supervisor: Dr. Othman O. KhalifaDr. Othman O. Khalifa

International Islamic University MalaysiaInternational Islamic University Malaysia

Kulliyyah of Engineering, ECE DeptKulliyyah of Engineering, ECE Dept..

Speech to Sign Language Interpreter System

OUTLINEOUTLINEProblem statement.Problem statement.Research goal and objectives.Research goal and objectives.Main parts of our system.Main parts of our system.The structure of ASR:The structure of ASR: – SP, SP, – Training: AM, Dictionary and LM, Training: AM, Dictionary and LM, – and Decoding: the Veterbi beam search.and Decoding: the Veterbi beam search.

Sign Language, ASL and ASL alphabets.Sign Language, ASL and ASL alphabets.Signed English.Signed English.Demo. of ASL in our SW.Demo. of ASL in our SW.Milestone.Milestone.

!!Problem StatementProblem Statement

There is no free software, let alone one with a There is no free software, let alone one with a reasonable price, to convert speech into sign language reasonable price, to convert speech into sign language in live mode.in live mode.

There is only one software commercially available to There is only one software commercially available to convert uttered speech in live mode to a video sign convert uttered speech in live mode to a video sign languagelanguage

This software is calledThis software is called iCommunicatoriCommunicator and in order to and in order to purchase it deaf person has to pay USD 6,499!purchase it deaf person has to pay USD 6,499!

IS IT FAIRIS IT FAIR??

RESEARCH GOAL AND OBJECTIVESRESEARCH GOAL AND OBJECTIVES

Design and Manipulation of Design and Manipulation of Speech to Sign Language Speech to Sign Language Interpreter SystemInterpreter System. . The SW is open source and freely available which in turn The SW is open source and freely available which in turn will benefit the deaf community.will benefit the deaf community.To fill the gap between deaf and nondeaf people in two To fill the gap between deaf and nondeaf people in two senses. Firstly, by using this SW for educational purposes senses. Firstly, by using this SW for educational purposes for deaf people and secondly, by facilitating the for deaf people and secondly, by facilitating the communication between deaf and nondeaf people.communication between deaf and nondeaf people.To increase independence and self-confidence of the deaf To increase independence and self-confidence of the deaf person.person.To increase opportunities for advancement and success in To increase opportunities for advancement and success in education, employment, personal relationships, and public education, employment, personal relationships, and public access venues.access venues.To improve quality of life.To improve quality of life.

Main Parts of Speech to Sign Language Main Parts of Speech to Sign Language Interpreter SystemInterpreter System

Speech-Recognition Engine

ASL pre-recorded Video-clipsDatabase

Recognized TextASL Translation

Continuous Input Speech

Recognized Text

Automatic Speech Recognition (Automatic Speech Recognition (ASRASR):):

SRSR systems are clustered according to three systems are clustered according to three categories: categories: IsolatedIsolated vs.vs. continuouscontinuous, , speakerspeaker dependentdependent vs.vs. speaker independentspeaker independent and and small small vs.vs. large vocabularylarge vocabulary. . The expected task of our software entails using The expected task of our software entails using a a large vocabularylarge vocabulary, , speaker independentspeaker independent and and continuouscontinuous speech recognizer. speech recognizer.

SR Engine Recognized TextInput Voice

The Structure of SR Engine (LVCSR)The Structure of SR Engine (LVCSR)

Signal Processing

AMP(A1, …, AT | P1,… , Pk)

DictionaryP(P1, P2, …, Pk | W )

LMP(Wn | W1, …, Wn-1)

X={x1,x2, …, xT }

Hypothesis Evaluation DecoderP(X | W)*P(W)

TRAINING

DECODING

Best Hypotheses

H = {W1, W2, …, Wk}

WBEST

Input Audio

SIGNAL PROCESSING (FRONT-END)SIGNAL PROCESSING (FRONT-END) : :

Pre-emphasis Framing WindowingSpeech

waveform

y[n] yt`[n]

Power Spectrum Calculation

yt[n]

Mel Filterbank

St[k]ln| |2

St[m]IDFT

13 ct[n]13 ct[n]13ct[n]

x[n], 16-bits

integer data

Pre-emphasis

is the pre-emphasis parameter.

]1[][][ nxnxny MFCC computation:

The The MFCCMFCC is a representation defined as the real cepstrum of a windowed is a representation defined as the real cepstrum of a windowed short-time signal derived from the FFT of that signal. short-time signal derived from the FFT of that signal.

MFCCMFCC computation consists of performing the inverse DFT on the logarithm computation consists of performing the inverse DFT on the logarithmof the magnitude of the filterbank output:of the magnitude of the filterbank output:

TYPICALLY FOR SPEECH RECOGNITION ONLYTYPICALLY FOR SPEECH RECOGNITION ONLY THE FIRST 13 COEFFICIENTS ARE USED.THE FIRST 13 COEFFICIENTS ARE USED.

1,...,0,)2

1(cos}.][ln{][

1

0

MnM

mnmSncM

mtt

Framing and WindowingFraming and Windowing Typical frame duration in speech recognition is 10 ms, Typical frame duration in speech recognition is 10 ms, while typical window duration is 25 ms.while typical window duration is 25 ms.

otherwise

NnN

nnw

0

1,...,01

2cos46.054.0

][

TtNnQtnynyt 1,0],.[][`

][].[][ ` nynwny tt

The mel filterbank:The mel filterbank:

It is used to extract spectral features It is used to extract spectral features of speech through properly integrating of speech through properly integrating a spectrum at defined frequency ranges.a spectrum at defined frequency ranges. The transfer function of the triangular The transfer function of the triangular mel-weighting filters mel-weighting filters HHmm[k][k] is given by: is given by:

The mel-spectrum of the power spectrum is computed by:The mel-spectrum of the power spectrum is computed by:

where where kk is the DFT domain index, is the DFT domain index, N N is the length of the DFT, and is the length of the DFT, and M M is total numberis total number of triangular mel-weighting filters.of triangular mel-weighting filters.

1

0

1,...,1,0][][][N

kmtt MmkHkSmS

]1[0

]1[][])[]1[])(1[]1[(

)]1[(2

][]1[])1[][])(1[]1[(

])1[(2

]1[0

][

mfk

mfkmfmfmfmfmf

kmf

mfkmfmfmfmfmf

mfk

mfk

kH m

Power SpectrumPower Spectrum SFT calculated using:SFT calculated using:

TO reduce computational complexity, is evaluated only for a discrete number ofTO reduce computational complexity, is evaluated only for a discrete number of values values =2=2k/Nk/N then the DFT of all frames of the signal is obtained: then the DFT of all frames of the signal is obtained:

The phase information of the DFT samples of each frame is discarded The phase information of the DFT samples of each frame is discarded

Final output of this stage is:Final output of this stage is:

n

njjt enwQtnyeY ][]..[][

1,...,0],[][ /2 NkeYkY Nkjtt

22 )][(()][(][ kYimagkYrealkS ttt

][ jt eY

Delta and Double Delta computation

First and Second order differences may be used to capture theFirst and Second order differences may be used to capture the dynamic evolution of the signal.dynamic evolution of the signal.

The first order delta MFCC computed from:The first order delta MFCC computed from:

The second order delta MFCC computed from:The second order delta MFCC computed from:

The final output of the FE processing would comprise 39 features The final output of the FE processing would comprise 39 features

vector (observations vector Xvector (observations vector X tt) per each processed frame. ) per each processed frame.

11 ttt ccc

22 ttt ccc

Speech waveform of a Speech waveform of a phoneme “\ae”phoneme “\ae”

After pre-emphasis and After pre-emphasis and Hamming windowingHamming windowing

Power spectrumPower spectrum MFCCMFCC

Explanatory ExampleExplanatory Example

TRAININGTRAINING

Acoustic Model (AM):Acoustic Model (AM):The The AMAM provides a mapping between a unit of speech and an HMM that provides a mapping between a unit of speech and an HMM that can be scored against incoming features provided by the Front-End. can be scored against incoming features provided by the Front-End. It contains a pool of a Hidden Markov ModelsIt contains a pool of a Hidden Markov Models (HMM). (HMM).For large vocabularies each word is represented as a sequence of For large vocabularies each word is represented as a sequence of phonemes, accordingly there has to be an AM per each phoneme, phonemes, accordingly there has to be an AM per each phoneme, moreover, it has to be depending on the context (e.g. co-articulation) and moreover, it has to be depending on the context (e.g. co-articulation) and even the context dependence may cross word boundary. even the context dependence may cross word boundary. Phones are then further refined into context-dependent Phones are then further refined into context-dependent triphonestriphones, , i.e.i.e., , phones occurring in given left and right phonetic contexts.phones occurring in given left and right phonetic contexts.

It is the process of learning the AM, Dictionary and LM .It is the process of learning the AM, Dictionary and LM .

AMP(A1, …, AT | P1,… , Pk)


LMP(Wn | W1, …, Wn-1)

HMMHMMss

HMM is defined by the model parameters HMM is defined by the model parameters =(A, B, π)=(A, B, π)..

For each acoustic segment, there is a probability distribution For each acoustic segment, there is a probability distribution across acoustic observations across acoustic observations bbii(k)(k). .

The leading technique is to represent the acoustic observations The leading technique is to represent the acoustic observations as a mixture Gaussian distribution or shortly Gaussian as a mixture Gaussian distribution or shortly Gaussian Mixtures (GM). Mixtures (GM).

S0 S1 S2 S3

a00 a11 a22

b0(k) b1(k) b2(k)

DictionaryDictionary::

Dictionary is a file contains pronunciations for all the words of interest to Dictionary is a file contains pronunciations for all the words of interest to the decoder. the decoder. For large vocabulary speech recognizers pronunciations are specified as a For large vocabulary speech recognizers pronunciations are specified as a linear sequence of linear sequence of phonemes.phonemes.

Some digits pronunciations:Some digits pronunciations:ZERO ZERO Z IH R O Z IH R O EIGHT EIGHT EY TD EY TD Multiple pronunciationsMultiple pronunciationsACTUALLY ACTUALLY AE K CH AX W AX L IY AE K CH AX W AX L IYACTUALLY(2nd) ACTUALLY(2nd) AE K SH AX L IY AE K SH AX L IYACTUALLY(3rd) ACTUALLY(3rd) AE K SH L IY AE K SH L IYCompound words:Compound words:WANT_TO WANT_TO W AA N AX W AA N AX

AMP(A1, …, AT | P1,… , Pk)


LMP(Wn | W1, …, Wn-1)

Language Model (LM):Language Model (LM):

It is a statistical LM where the speaker could be talking about It is a statistical LM where the speaker could be talking about any arbitrary topic. any arbitrary topic.

The main used model is the n-gram statistics and in particular The main used model is the n-gram statistics and in particular trigram trigram (n=3), (n=3), P(WP(Wtt|W|Wt-1t-1,W,Wt-2t-2).).

Bigram and Unigram LMs have to be employed as well. Bigram and Unigram LMs have to be employed as well.

AMP(A1, …, AT | P1,… , Pk)


LMP(Wn | W1, …, Wn-1)

RECOGNITIONRECOGNITION Given an input speech utterance the goal is to Given an input speech utterance the goal is to UNVEILUNVEIL the the BESTBEST hidden state hidden state sequence. sequence.

Let Let S=(sS=(s11,s,s22,…,s,…,sTT)) be the sequence of states that are recognized and be the sequence of states that are recognized and xxtt be the be the feature samples computed at time feature samples computed at time tt, where the feature sequence from time , where the feature sequence from time 11 to to t t is indicated as: is indicated as: X=(xX=(x11,x,x22,…,x,…,xt t ))..

Accordingly, the sequence of recognized states Accordingly, the sequence of recognized states S*S* could be obtained by: could be obtained by: S*=ArgMax P(S,X|S*=ArgMax P(S,X|))..

Dynamic Structure

Search Algorithm

S*

Static Structure

St , P(xt,{st}|{st-1},)

{St-1}

xt

The Veterbi Beam searchThe Veterbi Beam search Initialization:For

Goto XXRecursive Step:For {

Goto XX }Backtracking:

XX:For

Find pt(st*)= Max[Vt(i)]

Calculate the threshold For {If pt(st=j) MEMORIZE both Vt(j) and path "j"Else DISCARD Vt(j) }

Return

Ni 1

)()( 11 XbiV ii

Tt 2

)()()( 1 tkjktt XbajVkV

1,...,2,1)( *11

* TTtsVArgMaxs ttt

sequencebesttheissssS T ),...,,( **2

*1

*

Ni 1

b*)(sp ttbNj 1

b

NkFor 1

SIGN LANGUAGESIGN LANGUAGE

Sign LanguageSign Language is a communication system is a communication system using gestures that are interpreted visually. using gestures that are interpreted visually.

As a whole, As a whole, sign languagessign languages share the same share the same modalitymodality, a sign, but they differ from country , a sign, but they differ from country to country.to country.

AMERICAN SIGN LANGUAGEAMERICAN SIGN LANGUAGE ( (ASLASL))

ASLASL is the dominant sign language in the US, is the dominant sign language in the US, anglophone Canada and parts of Mexico.anglophone Canada and parts of Mexico.Currently, approximately 450,000 deaf people in Currently, approximately 450,000 deaf people in the United States use the United States use ASLASL as their primary as their primary language language ASLASL signs follow a certain order, just as words signs follow a certain order, just as words do in spoken English. However, in do in spoken English. However, in ASLASL one sign one sign can express meaning that would necessitate the can express meaning that would necessitate the use of several words in speech. use of several words in speech. The grammar of The grammar of ASLASL uses spatial locations, uses spatial locations, motion, and context to indicate syntax. motion, and context to indicate syntax.

ASL ALPHABETSASL ALPHABETS

It is a manual alphabet It is a manual alphabet representing all the letters of representing all the letters of the English alphabet, using the English alphabet, using only the hands. only the hands. Making words using a manual Making words using a manual alphabet is called alphabet is called fingerspellingfingerspelling. . Manual alphabets are a part of Manual alphabets are a part of sign languages sign languages For ASL, the one-handed For ASL, the one-handed manual alphabet is used. manual alphabet is used. FingerspellingFingerspelling is used to is used to complement the vocabulary of complement the vocabulary of ASL when spelling individual ASL when spelling individual letters of a word is the letters of a word is the preferred or only option, such preferred or only option, such as with proper names or the as with proper names or the titles of works. titles of works.

Aa Bb Cc Dd Ee Ff

Gg Hh Ii Jj Kk Ll

Mm Nn Oo Pp Qq Rr

Ss Tt Uu Vv Ww Xx

Yy Zz

SIGNED ENGLISH (SIGNED ENGLISH (SESE):):

SESE is a reasonable manual parallel to English. is a reasonable manual parallel to English. The idea behind The idea behind SESE and other signing systems parallel to English is and other signing systems parallel to English is that deaf people will learn English better if they are exposed, visually that deaf people will learn English better if they are exposed, visually through signs, to the grammatical features of English. through signs, to the grammatical features of English. SESE uses two kinds of gestures: uses two kinds of gestures: sign wordssign words and and sign markerssign markers. . Each Each sign wordsign word stands for a separate entry in a Standard English stands for a separate entry in a Standard English dictionary. dictionary. The The sign wordssign words are signed in the same order as words appear in an are signed in the same order as words appear in an English sentence. Sign words are presented in singular, non-past English sentence. Sign words are presented in singular, non-past form. form. Sign markersSign markers are added to these basic signs to show, for example, are added to these basic signs to show, for example, that you are talking about more than one thing or that some thing has that you are talking about more than one thing or that some thing has happened in the past. happened in the past. When this does not represent the word in mind, the manual alphabet When this does not represent the word in mind, the manual alphabet can be used to can be used to fingerspellfingerspell the word. the word. Most of signs in Most of signs in SESE are taken from the American Sign Language. But are taken from the American Sign Language. But these signs are now used in the same order as English words and these signs are now used in the same order as English words and with the same meaning. with the same meaning.

ASLASL vs. vs. SESE (an Example) (an Example)

It is alright if you have a lot

ASL Translation SE Translation

IT IS ALL RIGHT

IF YOU HAVE A LOT

A number of 2,600 ASL prerecorded video clips

In case of nonbasic word, extract the basic

word out of it

Recognized Word (SR engine’s output) Is the basic word

within the ASL database vocabulary?

The American Manual AlphabetOnly in case of a nonbasic input word,

append some suitable marker

Final Output

None of the databa

se contents

matche

d the inp

ut basic w

ord

No

Yes

Fingerspelling of the original input word

The equivalent ASL video clip of the input word, some marker could be appended

DEMONSTRATION OF THE ASL IN OUR SW:DEMONSTRATION OF THE ASL IN OUR SW:

Speech to Sign Language Interpreter System - Speech to Sign Language Interpreter System - MILESTONEMILESTONE

Thesis WritingOutline & Progress

SWDevelopment & Progress

% Drafted

Chapter 2: State-of-the-Art of SR

Chapter 3: Sphinx SR

Chapter 4: Sphinx Decoder

Chapter 5: Sign Language

Chapter 6: SW Demo ., Conclusions & Further Work

Appendices

SR Engine

ASL Database

Overall Integrated SW

Chapter 1: Introduction

% Completed

Thank YouThank You

Your Questions Are Your Questions Are

Most WelcomedMost Welcomed

Technology

Speech To Sign Language Interpreter System