Upload
kkkseld
View
5.567
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
By:By: Khalid El-Darymli Khalid El-Darymli G0327887G0327887Supervisor:Supervisor: Dr. Othman O. KhalifaDr. Othman O. Khalifa
International Islamic University MalaysiaInternational Islamic University Malaysia
Kulliyyah of Engineering, ECE DeptKulliyyah of Engineering, ECE Dept..
Speech to Sign Language Interpreter System
OUTLINEOUTLINEProblem statement.Problem statement.Research goal and objectives.Research goal and objectives.Main parts of our system.Main parts of our system.The structure of ASR:The structure of ASR: – SP, SP, – Training: AM, Dictionary and LM, Training: AM, Dictionary and LM, – and Decoding: the Veterbi beam search.and Decoding: the Veterbi beam search.
Sign Language, ASL and ASL alphabets.Sign Language, ASL and ASL alphabets.Signed English.Signed English.Demo. of ASL in our SW.Demo. of ASL in our SW.Milestone.Milestone.
!!Problem StatementProblem Statement
There is no free software, let alone one with a There is no free software, let alone one with a reasonable price, to convert speech into sign language reasonable price, to convert speech into sign language in live mode.in live mode.
There is only one software commercially available to There is only one software commercially available to convert uttered speech in live mode to a video sign convert uttered speech in live mode to a video sign languagelanguage
This software is calledThis software is called iCommunicatoriCommunicator and in order to and in order to purchase it deaf person has to pay USD 6,499!purchase it deaf person has to pay USD 6,499!
IS IT FAIRIS IT FAIR??
RESEARCH GOAL AND OBJECTIVESRESEARCH GOAL AND OBJECTIVES
Design and Manipulation of Design and Manipulation of Speech to Sign Language Speech to Sign Language Interpreter SystemInterpreter System. . The SW is open source and freely available which in turn The SW is open source and freely available which in turn will benefit the deaf community.will benefit the deaf community.To fill the gap between deaf and nondeaf people in two To fill the gap between deaf and nondeaf people in two senses. Firstly, by using this SW for educational purposes senses. Firstly, by using this SW for educational purposes for deaf people and secondly, by facilitating the for deaf people and secondly, by facilitating the communication between deaf and nondeaf people.communication between deaf and nondeaf people.To increase independence and self-confidence of the deaf To increase independence and self-confidence of the deaf person.person.To increase opportunities for advancement and success in To increase opportunities for advancement and success in education, employment, personal relationships, and public education, employment, personal relationships, and public access venues.access venues.To improve quality of life.To improve quality of life.
Main Parts of Speech to Sign Language Main Parts of Speech to Sign Language Interpreter SystemInterpreter System
Speech-Recognition Engine
ASL pre-recorded Video-clipsDatabase
Recognized TextASL Translation
Continuous Input Speech
Recognized Text
Automatic Speech Recognition (Automatic Speech Recognition (ASRASR):):
SRSR systems are clustered according to three systems are clustered according to three categories: categories: IsolatedIsolated vs.vs. continuouscontinuous, , speakerspeaker dependentdependent vs.vs. speaker independentspeaker independent and and small small vs.vs. large vocabularylarge vocabulary. . The expected task of our software entails using The expected task of our software entails using a a large vocabularylarge vocabulary, , speaker independentspeaker independent and and continuouscontinuous speech recognizer. speech recognizer.
SR Engine Recognized TextInput Voice
The Structure of SR Engine (LVCSR)The Structure of SR Engine (LVCSR)
Signal Processing
AMP(A1, …, AT | P1,… , Pk)
DictionaryP(P1, P2, …, Pk | W )
LMP(Wn | W1, …, Wn-1)
X={x1,x2, …, xT }
Hypothesis Evaluation DecoderP(X | W)*P(W)
TRAINING
DECODING
Best Hypotheses
H = {W1, W2, …, Wk}
WBEST
Input Audio
SIGNAL PROCESSING (FRONT-END)SIGNAL PROCESSING (FRONT-END) : :
Pre-emphasis Framing WindowingSpeech
waveform
y[n] yt`[n]
Power Spectrum Calculation
yt[n]
Mel Filterbank
St[k]ln| |2
St[m]IDFT
13 ct[n]13 ct[n]13ct[n]
x[n], 16-bits
integer data
Pre-emphasis
is the pre-emphasis parameter.
]1[][][ nxnxny MFCC computation:
The The MFCCMFCC is a representation defined as the real cepstrum of a windowed is a representation defined as the real cepstrum of a windowed short-time signal derived from the FFT of that signal. short-time signal derived from the FFT of that signal.
MFCCMFCC computation consists of performing the inverse DFT on the logarithm computation consists of performing the inverse DFT on the logarithmof the magnitude of the filterbank output:of the magnitude of the filterbank output:
TYPICALLY FOR SPEECH RECOGNITION ONLYTYPICALLY FOR SPEECH RECOGNITION ONLY THE FIRST 13 COEFFICIENTS ARE USED.THE FIRST 13 COEFFICIENTS ARE USED.
1,...,0,)2
1(cos}.][ln{][
1
0
MnM
mnmSncM
mtt
Framing and WindowingFraming and Windowing Typical frame duration in speech recognition is 10 ms, Typical frame duration in speech recognition is 10 ms, while typical window duration is 25 ms.while typical window duration is 25 ms.
otherwise
NnN
nnw
0
1,...,01
2cos46.054.0
][
TtNnQtnynyt 1,0],.[][`
][].[][ ` nynwny tt
The mel filterbank:The mel filterbank:
It is used to extract spectral features It is used to extract spectral features of speech through properly integrating of speech through properly integrating a spectrum at defined frequency ranges.a spectrum at defined frequency ranges. The transfer function of the triangular The transfer function of the triangular mel-weighting filters mel-weighting filters HHmm[k][k] is given by: is given by:
The mel-spectrum of the power spectrum is computed by:The mel-spectrum of the power spectrum is computed by:
where where kk is the DFT domain index, is the DFT domain index, N N is the length of the DFT, and is the length of the DFT, and M M is total numberis total number of triangular mel-weighting filters.of triangular mel-weighting filters.
1
0
1,...,1,0][][][N
kmtt MmkHkSmS
]1[0
]1[][])[]1[])(1[]1[(
)]1[(2
][]1[])1[][])(1[]1[(
])1[(2
]1[0
][
mfk
mfkmfmfmfmfmf
kmf
mfkmfmfmfmfmf
mfk
mfk
kH m
Power SpectrumPower Spectrum SFT calculated using:SFT calculated using:
TO reduce computational complexity, is evaluated only for a discrete number ofTO reduce computational complexity, is evaluated only for a discrete number of values values =2=2k/Nk/N then the DFT of all frames of the signal is obtained: then the DFT of all frames of the signal is obtained:
The phase information of the DFT samples of each frame is discarded The phase information of the DFT samples of each frame is discarded
Final output of this stage is:Final output of this stage is:
n
njjt enwQtnyeY ][]..[][
1,...,0],[][ /2 NkeYkY Nkjtt
22 )][(()][(][ kYimagkYrealkS ttt
][ jt eY
Delta and Double Delta computation
First and Second order differences may be used to capture theFirst and Second order differences may be used to capture the dynamic evolution of the signal.dynamic evolution of the signal.
The first order delta MFCC computed from:The first order delta MFCC computed from:
The second order delta MFCC computed from:The second order delta MFCC computed from:
The final output of the FE processing would comprise 39 features The final output of the FE processing would comprise 39 features
vector (observations vector Xvector (observations vector X tt) per each processed frame. ) per each processed frame.
11 ttt ccc
22 ttt ccc
Speech waveform of a Speech waveform of a phoneme “\ae”phoneme “\ae”
After pre-emphasis and After pre-emphasis and Hamming windowingHamming windowing
Power spectrumPower spectrum MFCCMFCC
Explanatory ExampleExplanatory Example
TRAININGTRAINING
Acoustic Model (AM):Acoustic Model (AM):The The AMAM provides a mapping between a unit of speech and an HMM that provides a mapping between a unit of speech and an HMM that can be scored against incoming features provided by the Front-End. can be scored against incoming features provided by the Front-End. It contains a pool of a Hidden Markov ModelsIt contains a pool of a Hidden Markov Models (HMM). (HMM).For large vocabularies each word is represented as a sequence of For large vocabularies each word is represented as a sequence of phonemes, accordingly there has to be an AM per each phoneme, phonemes, accordingly there has to be an AM per each phoneme, moreover, it has to be depending on the context (e.g. co-articulation) and moreover, it has to be depending on the context (e.g. co-articulation) and even the context dependence may cross word boundary. even the context dependence may cross word boundary. Phones are then further refined into context-dependent Phones are then further refined into context-dependent triphonestriphones, , i.e.i.e., , phones occurring in given left and right phonetic contexts.phones occurring in given left and right phonetic contexts.
It is the process of learning the AM, Dictionary and LM .It is the process of learning the AM, Dictionary and LM .
AMP(A1, …, AT | P1,… , Pk)
DictionaryP(P1, P2, …, Pk | W )
LMP(Wn | W1, …, Wn-1)
HMMHMMss
HMM is defined by the model parameters HMM is defined by the model parameters =(A, B, π)=(A, B, π)..
For each acoustic segment, there is a probability distribution For each acoustic segment, there is a probability distribution across acoustic observations across acoustic observations bbii(k)(k). .
The leading technique is to represent the acoustic observations The leading technique is to represent the acoustic observations as a mixture Gaussian distribution or shortly Gaussian as a mixture Gaussian distribution or shortly Gaussian Mixtures (GM). Mixtures (GM).
S0 S1 S2 S3
a00 a11 a22
b0(k) b1(k) b2(k)
DictionaryDictionary::
Dictionary is a file contains pronunciations for all the words of interest to Dictionary is a file contains pronunciations for all the words of interest to the decoder. the decoder. For large vocabulary speech recognizers pronunciations are specified as a For large vocabulary speech recognizers pronunciations are specified as a linear sequence of linear sequence of phonemes.phonemes.
Some digits pronunciations:Some digits pronunciations:ZERO ZERO Z IH R O Z IH R O EIGHT EIGHT EY TD EY TD Multiple pronunciationsMultiple pronunciationsACTUALLY ACTUALLY AE K CH AX W AX L IY AE K CH AX W AX L IYACTUALLY(2nd) ACTUALLY(2nd) AE K SH AX L IY AE K SH AX L IYACTUALLY(3rd) ACTUALLY(3rd) AE K SH L IY AE K SH L IYCompound words:Compound words:WANT_TO WANT_TO W AA N AX W AA N AX
AMP(A1, …, AT | P1,… , Pk)
DictionaryP(P1, P2, …, Pk | W )
LMP(Wn | W1, …, Wn-1)
Language Model (LM):Language Model (LM):
It is a statistical LM where the speaker could be talking about It is a statistical LM where the speaker could be talking about any arbitrary topic. any arbitrary topic.
The main used model is the n-gram statistics and in particular The main used model is the n-gram statistics and in particular trigram trigram (n=3), (n=3), P(WP(Wtt|W|Wt-1t-1,W,Wt-2t-2).).
Bigram and Unigram LMs have to be employed as well. Bigram and Unigram LMs have to be employed as well.
AMP(A1, …, AT | P1,… , Pk)
DictionaryP(P1, P2, …, Pk | W )
LMP(Wn | W1, …, Wn-1)
RECOGNITIONRECOGNITION Given an input speech utterance the goal is to Given an input speech utterance the goal is to UNVEILUNVEIL the the BESTBEST hidden state hidden state sequence. sequence.
Let Let S=(sS=(s11,s,s22,…,s,…,sTT)) be the sequence of states that are recognized and be the sequence of states that are recognized and xxtt be the be the feature samples computed at time feature samples computed at time tt, where the feature sequence from time , where the feature sequence from time 11 to to t t is indicated as: is indicated as: X=(xX=(x11,x,x22,…,x,…,xt t ))..
Accordingly, the sequence of recognized states Accordingly, the sequence of recognized states S*S* could be obtained by: could be obtained by: S*=ArgMax P(S,X|S*=ArgMax P(S,X|))..
Dynamic Structure
Search Algorithm
S*
Static Structure
St , P(xt,{st}|{st-1},)
{St-1}
xt
The Veterbi Beam searchThe Veterbi Beam search Initialization:For
Goto XXRecursive Step:For {
Goto XX }Backtracking:
XX:For
Find pt(st*)= Max[Vt(i)]
Calculate the threshold For {If pt(st=j) MEMORIZE both Vt(j) and path "j"Else DISCARD Vt(j) }
Return
Ni 1
)()( 11 XbiV ii
Tt 2
)()()( 1 tkjktt XbajVkV
1,...,2,1)( *11
* TTtsVArgMaxs ttt
sequencebesttheissssS T ),...,,( **2
*1
*
Ni 1
b*)(sp ttbNj 1
b
NkFor 1
SIGN LANGUAGESIGN LANGUAGE
Sign LanguageSign Language is a communication system is a communication system using gestures that are interpreted visually. using gestures that are interpreted visually.
As a whole, As a whole, sign languagessign languages share the same share the same modalitymodality, a sign, but they differ from country , a sign, but they differ from country to country.to country.
AMERICAN SIGN LANGUAGEAMERICAN SIGN LANGUAGE ( (ASLASL))
ASLASL is the dominant sign language in the US, is the dominant sign language in the US, anglophone Canada and parts of Mexico.anglophone Canada and parts of Mexico.Currently, approximately 450,000 deaf people in Currently, approximately 450,000 deaf people in the United States use the United States use ASLASL as their primary as their primary language language ASLASL signs follow a certain order, just as words signs follow a certain order, just as words do in spoken English. However, in do in spoken English. However, in ASLASL one sign one sign can express meaning that would necessitate the can express meaning that would necessitate the use of several words in speech. use of several words in speech. The grammar of The grammar of ASLASL uses spatial locations, uses spatial locations, motion, and context to indicate syntax. motion, and context to indicate syntax.
ASL ALPHABETSASL ALPHABETS
It is a manual alphabet It is a manual alphabet representing all the letters of representing all the letters of the English alphabet, using the English alphabet, using only the hands. only the hands. Making words using a manual Making words using a manual alphabet is called alphabet is called fingerspellingfingerspelling. . Manual alphabets are a part of Manual alphabets are a part of sign languages sign languages For ASL, the one-handed For ASL, the one-handed manual alphabet is used. manual alphabet is used. FingerspellingFingerspelling is used to is used to complement the vocabulary of complement the vocabulary of ASL when spelling individual ASL when spelling individual letters of a word is the letters of a word is the preferred or only option, such preferred or only option, such as with proper names or the as with proper names or the titles of works. titles of works.
Aa Bb Cc Dd Ee Ff
Gg Hh Ii Jj Kk Ll
Mm Nn Oo Pp Qq Rr
Ss Tt Uu Vv Ww Xx
Yy Zz
SIGNED ENGLISH (SIGNED ENGLISH (SESE):):
SESE is a reasonable manual parallel to English. is a reasonable manual parallel to English. The idea behind The idea behind SESE and other signing systems parallel to English is and other signing systems parallel to English is that deaf people will learn English better if they are exposed, visually that deaf people will learn English better if they are exposed, visually through signs, to the grammatical features of English. through signs, to the grammatical features of English. SESE uses two kinds of gestures: uses two kinds of gestures: sign wordssign words and and sign markerssign markers. . Each Each sign wordsign word stands for a separate entry in a Standard English stands for a separate entry in a Standard English dictionary. dictionary. The The sign wordssign words are signed in the same order as words appear in an are signed in the same order as words appear in an English sentence. Sign words are presented in singular, non-past English sentence. Sign words are presented in singular, non-past form. form. Sign markersSign markers are added to these basic signs to show, for example, are added to these basic signs to show, for example, that you are talking about more than one thing or that some thing has that you are talking about more than one thing or that some thing has happened in the past. happened in the past. When this does not represent the word in mind, the manual alphabet When this does not represent the word in mind, the manual alphabet can be used to can be used to fingerspellfingerspell the word. the word. Most of signs in Most of signs in SESE are taken from the American Sign Language. But are taken from the American Sign Language. But these signs are now used in the same order as English words and these signs are now used in the same order as English words and with the same meaning. with the same meaning.
ASLASL vs. vs. SESE (an Example) (an Example)
It is alright if you have a lot
ASL Translation SE Translation
IT IS ALL RIGHT
IF YOU HAVE A LOT
A number of 2,600 ASL prerecorded video clips
In case of nonbasic word, extract the basic
word out of it
Recognized Word (SR engine’s output) Is the basic word
within the ASL database vocabulary?
The American Manual AlphabetOnly in case of a nonbasic input word,
append some suitable marker
Final Output
None of the databa
se contents
matche
d the inp
ut basic w
ord
No
Yes
Fingerspelling of the original input word
The equivalent ASL video clip of the input word, some marker could be appended
DEMONSTRATION OF THE ASL IN OUR SW:DEMONSTRATION OF THE ASL IN OUR SW:
Speech to Sign Language Interpreter System - Speech to Sign Language Interpreter System - MILESTONEMILESTONE
Thesis WritingOutline & Progress
SWDevelopment & Progress
% Drafted
Chapter 2: State-of-the-Art of SR
Chapter 3: Sphinx SR
Chapter 4: Sphinx Decoder
Chapter 5: Sign Language
Chapter 6: SW Demo ., Conclusions & Further Work
Appendices
SR Engine
ASL Database
Overall Integrated SW
Chapter 1: Introduction
% Completed
Thank YouThank You
Your Questions Are Your Questions Are
Most WelcomedMost Welcomed