View
215
Download
0
Category
Tags:
Preview:
Citation preview
Multimodal Information Analysis for Emotion
Recognition(Tele Health Care Application)
Malika MeghjaniGregory Dudek and Frank P. Ferrie
Our Goal
• Automatic emotion recognition using audio-visual information analysis.
• Create video summaries by automatically labeling the emotions in a video sequence.
Motivation• Map Emotional States of the Patient to
Nursing Interventions.• Evaluate the role of Nursing Interventions for
improvement in patient’s health.
NURSING INTERVENTIONS
Proposed Approach
Visual Feature
Extraction
Audio Feature
Extraction
Visual based Emotion
Classification
Audio based Emotion
Classification
Data Fusion
Recognized Emotional
StateDecision Level Fusion
Visual Analysis
1. Face Detection (Voila Jones Face Detector using Haar
Wavelets and Boosting Algorithm)
2. Feature Extraction (Gabor Filter: 5 Spatial Frequencies
and 4 orientations, 20 filter responses for each frame)
3. Feature Selection (Select the most discriminative features
in all the emotional classes)
4. SVM Classification (Classification with probability
estimates)
Visual Feature Extraction
X =
( 5 Frequencies X 4 Orientation ) Frequency Domain Filters
Automatic Emotion
Classification
Feature Selection
Audio Analysis
1. Audio Pre-Processing (Remove leading and trailing
edges)
2. Feature Extraction (Statistics of Pitch and Intensity
contours and Mel Frequency Cepstral Coefficients)
3. Feature Normalization (Remove inter speaker
variability)
4. SVM Classification (Classification with probability
estimates)
Audio Feature Extraction
Speech RatePitchIntensitySpectrum AnalysisMel Frequency Cepstral Coefficients(Short-term Power Spectrum of Sound)
Audio Feature
Extraction
Automatic Emotion
Classification
Audio Signal
Feature Selection
Average Count of Error Features Distance Selected
Bounding Planes
• Feature selection method is similar to the SVM classification. (Wrapper Method)
• Generates a separating plane by minimizing the weighted sum of distances of misclassified data points to two parallel planes.
• Suppress as many components of the normal to the separating plane which provide consistent results for classification.
Data Fusion1. Decision Level:
• Obtaining probability estimate for each emotional class using SVM margins.
• The probability estimates from two modalities are multiplied and re-normalized to the give final estimation of decision level emotional classification.
2. Feature Level:
Concatenate the Audio and Visual feature and repeat feature selection and SVM classification process.
Database and Training
Database:1. Visual only posed database2. Audio Visual posed database
Training: • Audio segmentation based on minimum
window required for feature extraction.• Corresponding visual key frame
extraction in the segmented window. • Image based training and audio segment
based training.
Experimental Results
Statistics
Database
No. of Training
Examples
No. of Subjects
No. of Emotional
State
% Recognition
Rate
Validation Method
Posed Visual Data
Only (CKDB)
120 20 5+Nuetral 75% Leave one subject
out cross validation
Posed Audio
Visual Data (EDB)
270 9 6 82%
76%
Decision Level
Feature Level
Time Series Plot
( * Cohen Kanade Database, Posed Visual only Database)
75% Leave One Subject Out Cross Validation Results
Surprise Sad Angry Disgust Happy
Conclusion• Combining two modalities (Audio and Visual)
improves overall recognition rates by 11% with Decision Level Fusion and by 6% with Feature Level Fusion
• Emotions where vision wins: Disgust, Happy and Surprise.
• Emotions where audio wins: Anger and Sadness• Fear was equally well recognized by the two
modalities.• Automated multimodal emotion recognition is
clearly effective.
Recommended