Upload
lillian-blake
View
219
Download
0
Embed Size (px)
Citation preview
The SIMILAR NoE Summer Workshop 2005
Combined Gesture-Speech Analysis and Synthesis
M. Emre Sargın, Engin Erzin, Yücel Yemez, A. Murat Tekalp{msargin,eerzin,yyemez,mtekalp}@ku.edu.tr
Multimedia Vision and Graphics Laboratory, Koc University
The SIMILAR NoE Summer Workshop 2005
Outline Project Objective Technical Description
Preparation of Gesture-Speech Database Detection of Gesture Elements Gesture-Speech Correlation Analysis Synthesis of Gestures Accompanying Speech
Resources Work Plan Team Members
The SIMILAR NoE Summer Workshop 2005
Project Objective The production of speech and gesture is interactive
throughout the entire communication process. Computer-Human Interaction systems should be
interactive such that, for an edutainment application, animated person’s speech should be aided and complemented by it’s gestures.
Two main goals of this project: Analysis and modeling of correlation between speech and
gestures. Synthesis of correlated natural gestures accompanying
speech.
The SIMILAR NoE Summer Workshop 2005
Technical Description Preparation of Gesture-Speech Database Detection of Gesture Elements Gesture-Speech Correlation Analysis Synthesis of Gestures Accompanying Speech
The SIMILAR NoE Summer Workshop 2005
Preparation of Database
Gestures of a specific person will be investigated. The video database related with that specific person
should include the gestures that he/she frequently uses. Locations of head, arm, elbows, etc. should easily be
detectable and traceable.
The SIMILAR NoE Summer Workshop 2005
Detection of Gesture Elements
In this project, we consider arm and head gestures. Main tasks included in detection of gesture elements:
Tracking of head region. Tracking of hand and possibly shoulder and elbow. Extraction of gesture features. Recognition and labeling of gestures.
The SIMILAR NoE Summer Workshop 2005
Head Region Tracking To extract motion information coming from head one
should first extract head region. Exhaustive search of head in each frame is a possible
solution. However this is computationally inefficient. Tracking is efficient by the means of computational
complexity. Motion information calculated for tracking will be used
for head gesture features.
The SIMILAR NoE Summer Workshop 2005
Tracking Methodology Exhaustive search for head region in initial frame
Haar-Based Face Detection Skin Color information
Extraction of motion information from head region Optical flow vectors Fitting global motion parameters optical flow vectors
Warp search window according to motion information. Search for head region in the search window.
The SIMILAR NoE Summer Workshop 2005
Head Tracking Results
The SIMILAR NoE Summer Workshop 2005
Hand Tracking Methodology Hand region will be extracted using skin color
information. Robust State-Space Tracking will be applied.
Observations are position of hand. States are position, speed and acceleration of hand. Kalman Filtering removes unwanted noise from features In Regular Kalman Filter, parameters are fixed. In Robust Kalman Filter parameters are re-adjusted for
each iteration to minimize MSE and overcome the effects of abrupt changes in motion of hand.
The SIMILAR NoE Summer Workshop 2005
Extraction of Gesture Features Head Gesture Features: Global Motion Parameters
calculated within head region will be used. Hand Gesture Features: Hand center of mass position
and calculated velocity will form hand gesture features.
The SIMILAR NoE Summer Workshop 2005
Gesture-Speech Correlation Analysis
Recognized gestures are labeled w.r.t. time. Head Gestures: Down, Up, Left, Right, Left-Right, … Arm Gestures: Abduction, Adduction, Extension, …
Recognized speech patterns are labeled w.r.t. time. Semantic Info: Approval, Refusal phrases, etc. Prosodic Info: Intonational phrases, ToBI transcriptions, etc.
Correlation Analysis via examining Co-occurrence Matrix Input/Output Hidden Markov Models
The SIMILAR NoE Summer Workshop 2005
Co-occurrence Matrix Estimation of joint probability distribution function, f(g,s) For each time sample give a vote to related gesture-
speech label pair. For a specific speech element the most correlated
gesture feature will be: gi=argmax ( f (gx,si) )
Relatively easy to compute. Gives an intuition about what we are examining.
x
The SIMILAR NoE Summer Workshop 2005
Input/Output Hidden Markov Models IOHMM is a graphical model which allows the mapping of
input sequences into output sequences. It is used in three tasks of sequence processing:
Prediction Regression Classification
The model is trained to maximize the conditional distribution of an output sequence {y1,…,yt} given an input sequence {x1,…,xt}.
In our project: Input sequence will be speech labels. Output sequence will be gesture labels.
The SIMILAR NoE Summer Workshop 2005
Synthesis of Gestures Accompanying Speech
Based on the methodology used in correlation analysis given a speech signal: Features will be extracted. Most probable speech label will be designated to speech
patterns. Gesture pattern that is most correlated with speech pattern
will be used to animate a stick model of a person.
The SIMILAR NoE Summer Workshop 2005
Resources Database Preparation and Labeling
VirtualDub Anvil Paraat
Image Processing and Feature Extraction: Matlab Image Processing Toolbox OpenCV Image Processing Library
Gesture-Speech Correlation Analysis HTK HMM Toolbox Torch Machine Learning Library
The SIMILAR NoE Summer Workshop 2005
Work Plan Timeline of the project:
Schedule of the lectures:
The SIMILAR NoE Summer Workshop 2005
Team Members Ferda Ofli
Koc University Image, Video Processing and Feature Extraction
Yelena Yasinnik Massachusetts Institute of Technology Audio-Visual Correlation Analysis
Oya Aran Bogazici University Gesture Based Human-Computer Interaction Systems
The SIMILAR NoE Summer Workshop 2005
Team Members Alexey Anatolievich Karpov
Saint-Petersburg Institute for Informatics and Automation Speech Based Human-Computer Interaction Systems
Stephen Wilson University College Dublin Audio-Visual Gesture Annotation
Alexander Refsum Jensenius Department of Music, Oslo University Gesture Analysis
The SIMILAR NoE Summer Workshop 2005
References Jie Yao and Jeremy R. Cooperstock, “Arm Gesture Detection in a Classroom Environment,”
Proc. WACV’02 pp. 153-157, 2002. Y. Azoz, L. Devi. R. Sharma, “Tracking Hand Dynamics in Unconstrained Environments,” Proc.
Int. Conference on Automatic Face and Gesture Recognition’98 pp. 274-279, 1998. S. Malassiotis, N. Aifanti, M.G. Strintzis, “A Gesture Recognition System Using 3D Data,” Proc.
Int. Symposium on 3D Data Processing Visualization and Transmission’02 pp. 190-193,2002. J-M. Chung, N. Ohnishi, “Cue Circles: Image Feature for Measuring 3-D Motion of Articulated
Objects Using Sequential Image Pair,” Proc. Int. Conference on Automatic Face and Gesture Recognition’98 pp. 474-479, 1998.
S. Kettebekov, M. Yeasin, R. Sharma, “Prosody based co-analysis for continuous recognition of coverbal gestures,”Proc. ICMI’02 pp.161-166, 2002.
F. Quek, D. McNeill, R. Ansari, X-F. Ma, R. Bryll, S. Duncan, K.E. McCullough “Gesture cues for conversational interaction in monocular video,” Proc. Int. Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems’99 pp. 119-126, 1999.
For detailed information visit: http://htk.eng.cam.ac.uk Rabiner, L.; Juang, B., “An introduction to hidden Markov models” ASSP Magazine, IEEE, Vol.3,
Iss.1, pp. 4- 16, Jan 1986 Jae-Moon Chung; Ohnishi, N., “Cue circles: image feature for measuring 3-D motion of
articulated objects using sequential image pair” Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, Vol., Iss., pp. 474-479, 14-16 Apr 1998
A. Just, O. Bernier, S. Marcel., “Recognition of isolated complex mono- and bi-manual 3D hand gestures” Proc. 6. ICAFGR, 2004