Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Real Time Characterization of Lip MovementsUsing Webcam
Suresh D S, Sandesh C
Director & Head of ECE Dept, CIT, Gubbi, Tumkur, Karnataka.
PG Student, ECE Dept, CIT, Gubbi, Tumkur, Karnataka.
Abstract: Communication plays an important role in the day to day activities of human being. The mainobjective of this paper is to help the people who are unable to speak. Visual speech plays a major role in thelip reading for listeners with deaf person. Here we are using local spatiotemporal descriptors in order toidentify or recognize the words or phrases from the disable (dumb) people by their lip movements. Localbinary patterns extracted from the lip movements are used to recognize the isolated phrases. Experimentswere made on ten speakers with 5 phrases and obtained accuracy of about 75% for speaker dependent and65% for speaker independent. While being made comparison with other concepts like AV letter database,our method outperforms the other by accuracy of 65%. The advantages of our method are robustness andrecognition is possible in real time.
Key words: LBP Top, Webcam, Visual speech, KNN classifier, Silent speech interface.
I. Introduction
In recent days, a silent speech interface has becoming an alternative in the field of speech processing. Avoice based speech communication has suffered from many challenges which arises due to fact that speechshould be clearly audible, it cannot be masked, includes lack of robustness privacy issues and exclusion ofspeech disabled person. So these challenges may be overcome by the use of silent speech interface. Silentspeech interface is a device, where system enabling speech communication takes place, without thenecessity of a voice signal or when audible acoustic signal is unavailable. In our approach, lip movementsare captured by the use of a webcam, which is placed in front of the lips. Many research work focuses onlyon visual information to enhance speech recognition[1]. Audio information still plays a major role than thevisual feature or information. But, in most cases it is very difficult to extract information. In our method weare concentrating on the lip movement representations for speech recognition solely with visualinformation.
Extraction of set of visual observation vectors is the key element on AVSR(audio visual speech recognition)system. Geometric feature, combined feature and appearance features are mainly used for representingvisual information. Geometric feature method represents the facial animation parameters such as lipmovement, shape of jaw and width of the mouth. These methods require more accurate and reliable facialfeature detection, which are difficult in practice and impossible at low image resolution.
In this paper, we propose an approach for lip reading or lip movements, where human computerinteraction improves significantly and understanding in noisy environment also improves.
II. Local Spatiotemporal Descriptors for Visual Information
In this paper, we are concentrating on LBP TOP in order to extract the features from the extracted videoframes. The Local Binary Pattern (LBP) operator is a gray scale which is not varied, always remains thesame texture, simple statistic, which has shown the best performance in the classification of different kindsof texture. For an individual pixel in an image its respective binary will be generated and the thresholds willbe compared with the its neighborhood value of center pixel as shown in Figure 1(a) [1].
Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 34
NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in
Downloaded fro
m w
ww.e
dlib
.asd
f.re
s.in
Where ggrey valuthe occuneighbortexture p
Local textheir robrecognitiTOP mefeaturesand T. Fpoints inPYT. Thethe centgiven bygXT,p aneighborthree axewe use L
Figure.
gc denotes thues of P equaurrences of vrhoods with aprimitives or
xture descripbustness to cion using locthod is morein two dimen
For LBP TOPn the XY, XT,e LBP TOP feer pixel gtc,e(xc RXsin(2
are given byrhood in YTes are the samBP TOPP,R f
2. (a) Volum
Figure. 1. (a)
he grey valueally spaced pivarious binaany numbermicro patter
ptors have obhallenge succal binary pae efficient thnsion, where, the radii in, and YT planfeature is thee are (xc, yc,p/PXY),yc+R
y (xc RXsinplane gYT,p
me and so dofor abbreviati
me of utteranc
Basic LBP op
of center pixixels on a cirary patterns.of pixels as srns, like spots
btained tremh as pose anatterns were ehan the ordinas in LBP TO
n spatial andnes can also bn denoted as, tc), than thRY cos(2 p/P(2 p/PXT),yc(xc, yc RYcothe numberion where P=
ce sequence.
perator. (b) C
xel{xc, yc} ofrcle of radiusThe definit
shown in Figs, lines and co
mendous attennd illuminatioextracted fronary LBP. InOP we are extemporal axbe different as LBP TOP Phe coordinatePXY), tc), thec, tc –RTos(2 p/PYT),of neighbori
=PXY=PXT=PY
(b) Image inplane.
Circular (8,2)
f the local nes R. A histogrtion of neighgure 1(b). Byorners.
ntion in theon changes.om the threeordinary LB
xtracting infoxes X, Y, andand can be mPXY, PXT, PYes of local nee coordinatescos(2 p/PXT,, tc –RT sin(ng points in XYT and R=RX
XY plane (c)
neighborhoo
ighborhood aram is createhbors can bthis way, we
analysis of fIn our approorthogonal p
BP we are extormation in thd T, and themarked as RXYT, RX, RY, Reighborhoods of local neiT)) and the(2 p/PYT)). SXY, XT, and YX=RY=RT[1].
Image in XT
od.
and gp is reld in order toe extendedcan collect l
facial imageoach a tempoplanes(LBP Ttracting infohree dimensinumber of n
X, RY and RT,RT. If the cooin XY plane
ighborhood ine coordinateSometimes, tYT planes. In
T plane(d) Im
ated to theo collect upto circularlarger scale
because oforal textureTOP). LBPormation orion i.e, X, YneighboringPXY, PXT,
ordinates ofe gXY,p aren XT planes of localthe radii inn that case ,
mage in TY
Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 35
NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in
Downloaded fro
m w
ww.e
dlib
.asd
f.re
s.in
Figure 2Figure 2(Figure 2(over theindicatiodividingof the LBcode ofcorrespo
Figure 3.LBP YT i
Figure. 4(c) Conca
When a“see you”
(a) demonstr(c) is an ima(d) describese whole utteron about thethe mouth imBP images. Tevery pixel
onding to mou
. Mouth regioimages (last r
4. Features inatenated feat
person utter”. If we do no
rates the volage in the XTs the motionrance sequeneir locations.mage into sevThe second, tfrom XY (seuth images in
on images (frow) from on
each block vtures for one
Fig
rs a commanot consider th
lume of utteT plane proviof one columnce it encod. In order toveral overlapthird, and fouecond row),n the first row
first row), LBne utterance[1
volume. (a) Bblock volum
gure. 5. Mouth
nd phrase, thhe time order
erance sequending a visualmn in tempores only theo overcomepping blocks iurth rows shXT (third ro
w.
BP XY images1].
Block volumesme with the ap
h movement
he words arer, these two p
nce. Figure 2l impressionral space[1]. Aoccurrencesthis effect, ais introducedhow the LBPow), and YT
s (second row
s. (b) LBP feappearance and
representati
pronouncedphrases would
2(b) shows iof one row cAn LBP descrof the microa representad. Figure. 3 alimages whic
T (fourth row
w), LBP XT i
atures from td motion[1].
on[1].
d in order, fod generate al
image in thechanging in tription wheno patterns wation whichlso gives somch are drawnw) planes, re
images (third
three orthogo
or instance “ymost the sam
e XY plane.time, whilen computedwithout anyconsists of
me examplesn using LBPespectively,
d row), and
onal planes.
you see” orme features.
Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 36
NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in
Downloaded fro
m w
ww.e
dlib
.asd
f.re
s.in
To overcregions bvolumeextractedregion se
A histogr
Multi resbe increaWhen thvector wthe multlocation,vertical mparameteneighbor(verticaltime scal
Our systsecond sto recogn
come this effbut also in tare computed from eachequence, as sh
ram of the m
solution featuased. By usinhe features frould be veryti resolutionwith what rmotion thaters, three diffring points wmotion) slicles; 3) Use of
em consists otage extractsnize the inpu
fect, the whotime order, aed and concblock volumhown in Figu
mouth movem
III. Multi
ures will provg these multirom the diffelong and as afeatures willresolution anare very im
fferent types owhen compuces; 2) Use oblocks of diff
of three stages the visual feut utterance u
ole sequenceas shown incatenated intme are conneure. 5.
ments can be d
iresolution
vide more acci resolution frent resolutia result the cnot contribu
nd more impmportant. Weof spatiotemputing the feaf different rafferent sizes t
IV
es, as showneatures fromusing an KNN
Figur
is not only dFigure. 4(a)to a single hcted to repre
defined as fol
n Features a
curate informfeatures, it heon were madcomputationaute equally, sortantly thee need featuporal resolutitures in XYadii that cano create glob
V. Our Syst
in Figure 6. Tthe mouth m
N classifier.
re 6. System d
divided intoshows. Thehistogram, aesent the app
llows
and Feature
mation and alselps in increade to be connal complexityso it is necestypes such a
ure selectionion are prese(appearancecapture the
bal and local s
tem
The first stagmovement seq
diagram
block volumLBP TOP hiss shown inpearance and
e Selection
so the analysasing the numnected in sery will be to hissary to findas appearancefor this purnted: 1) Use oe), XT (horizoccurrences
statistical fea
ge is a detectiquence. The
mes accordingstograms inFigure. 4. Ad motion of
n
sis of dynamicmber of featuries directly,igh. It is obviout featurese, horizontalrpose. In chof a differentzontal motions in differenttures[4][5].
ion lip movemrole of the fi
g to spatialeach blockAll featuresthe mouth
c event willures greatly.the featureous that alllike which
l motion oranging thet number ofn), and YTt space and
ments. Thenal stage is
Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 37
NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in
Downloaded fro
m w
ww.e
dlib
.asd
f.re
s.in
In our apfrom theclassifierto varioupoints, thused to acase, 1 Ntraining,to genera
For morincluding
1. Speakeutilized.with 10 dspeakerssequencetake intowhole seFigure. 7worst, mmouth re
2) Speakutilized fand after
3) One Oclassificayou”, “Yoof classifWhen th
Figure. 8(vertical
pproach webe mouth regr is selected sus object detehe phrase claccomplish rNN templatethe spatiotemate a histogra
e accurate eg speaker ind
er IndependeWe made a
different spea. The overales and N is tho account noequence is di7 demonstrat
mainly becauseegion.
ker Dependenfor cross valir testing is pe
One versus Oation problemou are welcomfiers grows qhe class numb
8. Selected 15motion) is se
bcam is usedgion. These fsince it is welection tasks ilassification precognition. Smatching ismporal LBP ham template
V
evaluation ofdependent, sp
ent Experimetraining of pakers and whl results werhe total numot only locatiivided into bte the perfore the big mou
Figure
nt Experimendation. In ouerformed with
One Rest Recom is decompome” “Have aquadraticallyber is N, the n
5 slices for pelected, and “
to capture tfeatures arell founded inin computerproblem is dSometimes ms applied tohistograms ofor that class
V. Experime
f our propospeaker depen
ents: For theparticular phren testing ware obtained umber of testinions of microblock volumermance for eustache of th
e. 7. Recogniti
nts: For speaur approach wh the same sp
ognition: In tosed into 45a good time”,with the numnumber of th
phrases “See y“_” the XT sli
he lip movemthen fed to
n statistical levision. Sinceecomposed i
more than onthese classe
of utterance ss[1].
ents Protoc
ed method,ndent, multi r
speaker indrase using onas done we wusing M/N (ng sequences)o patterns bues accordingevery speakerhat speaker re
ion performa
aker dependewhen a particpeaker 70% o
the previoustwo class proetc.). But us
mber of classhe KNN classi
you” and “Thce (horizonta
ments. Furththe classifie
earning theore the KNN isnto two class
ne class gets ts to reach thequences bel
ol and Resu
we made aresolution.
ependent exne speaker anwere successfu(M is the tot). When we aut also the tito not onlyr. The resulteally influence
ance for every
ent experimecular speakerof same phras
experimentsoblems (“Helsing this mulses to be recfiers would b
hank you”. “al motion), “/
her we will exer. For speecry and has beonly used fos problems, tthe highest nhe final resulonging to a g
ults
design with
periments, lend we madeul in gettingtal numberare extractinime order ofspatial regiots from the ses the appear
y speaker
ents, the leavr is trained wses were iden
s on our ownllo” “See youltiple two claognized likebe N (N 1)/2.
” in the blo/” means the
xtract the visch recognitioeen successfuor separatingthen a votingnumber of voult. This meagiven class ar
different ex
eave one speto say the sathe same phrof correctlyg the local pf lip movemens but also tsecond speakrance and mo
ve one utterawith some listtified correct
dataset, theu”, “I am sorrass strategy, tin AVLetter
cks means thappearance X
sual featureon, a KNNully appliedtwo sets ofg scheme isotes; in thisans that inre averaged
xperiments,
eaker out isame phraserase from 6recognizedatterns, weents, so thetime order.ker are theotion in the
ance out ist of phrasestly.
ten phrasery” “Thankthe numberrs database.
he YT sliceXY slice[1].
Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 38
NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in
Downloaded fro
m w
ww.e
dlib
.asd
f.re
s.in
Figure 8most diffThe seleclast part.
Results f
A real timhelp theutteranceten speaapproach
1. LN
2. Tt
3. Ga9
4. Gs5
5. GeA
shows the sfficult to recocted slices ar. The selected
from One to
me capture odisable perse. LBP TOPakers and thh, our method
Lipreading WNovember 20T. Ahonen, Ato face recognG. Zhao andapplication t928, 2007.G. Zhao, M.spoken phras57–65.G. Zhao andexpression rAnalysis and
selected slicegnize becausre mainly ind features are
One and OnThe
of lip movemson. Our appis used to ex
he lip movemd outperform
With Local Sp009.A. Hadid, andnition,” IEEEd M. Pietikäio facial expr
Pietikäinen,ses,” in Proc.
d M. Pietikäecognition,”HCI, 2009, a
s for similarse they are quthe first ande consistent w
ne to Rest ClaParentheses
ments using wproach uses lxtract the feaments were cms the other b
atiotemporal
d M. PietikäinTrans. Pattenen, “Dynamressions,” IEE
and A. Hadid2nd Int. Wo
äinen, “BoosPattern Rec
accepted for p
phrases “seeuite similar insecond part
with the hum
Table I
assifiers on Seare From On
Conclusio
webcam for rlocal spatioteatures from thconverted inby accuracy o
Reference
l Descriptors
nen, “Face deern Anal. Macmic texture rEE Trans. Pat
d, “Local spaorkshop Hum
sted multi recognit. Lett.,publication.
e you” and “tn the latter pof the phras
man intuition.
emi Speakerne to Rest Str
on
recognition oemporal desche capturednto speech vof 70%.
es
IEEE Transa
escription witch. Intell., volrecognition uttern Anal. M
atiotemporalman Centered
esolution spa, Special Iss
thank you”. Tpart containinse; just one v
Dependent Erategy)
of speech wacriptors for thimages. Expevery efficient
actions On Mu
th local binarl. 28, no. 12, pusing local bMach. Intell.,
descriptors fd Multimedia
atiotemporalsue on Imag
These phraseng the same wvertical slice
Experiments
as proposed,he recognitioeriments werly. Compare
ultimedia, Vo
ry patterns: App. 2037–2041binary patternvol. 29, no.
for visual reca (HCM2007)
l descriptorsge/Video Bas
es were theword “you”.is from the
(Results in
in order toon of inputre made oned to other
ol. 11, No. 7,
Application1, 2006.ns with an6, pp. 915–
ognition of), 2007, pp.
s for facialed Pattern
Proceedings of The National Conference on Man Machine Interaction 2014 [NCMMI 2014] 39
NCMMI '14 ISBN : 978-81-925233-4-8 www.edlib.asdf.res.in
Downloaded fro
m w
ww.e
dlib
.asd
f.re
s.in