View
4
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Audio-Visual Speech Processing presentation at the IEEE conference on Advanced Technologies for Signal and Image Processing in Sousse, Tunisia, March 18th, 2014
Citation preview
Audio-Visual Speech Processing
Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber,
Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari
ATSIP, Sousse, March 18th 2014
Page 2 ATSIP, Sousse, May 18th, 2014
Some motivations,…
■ A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.
■ The combined use of facial and speech information improves identity verification and robustness to forgeries.
■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.
■ SmartPhones, VisioPhones, WebPhones, SecurePhones, Visio Conferences, Virtual Reality worlds are gaining popularity.
Page 3 ATSIP, Sousse, May 18th, 2014
Some topics under study,…
■ Audio-visual speech recognition – Automatic ‘lip-reading’
■ Audio-visual speaker verification – Detection of forgeries
■ Speech driven animation of the face – Could we look and sound like somebody else ?
■ Speaker indexing – ‘Who is talking in a video sequence ?’
■ OUISPER : a silent speech interface – Corpus based synthesis from tongue and lips
Page 4 ATSIP, Sousse, May 18th, 2014
Audio Visual Speech Recognition
Dictionary Grammar
Acoustic models
Features extraction
Decoder
Page 5 ATSIP, Sousse, May 18th, 2014
Video Mike (IBM, 2004)
■ IBM
■ 2004
Page 6 ATSIP, Sousse, May 18th, 2014
Audio processing
■ Features extraction ■ Digits detection ■ Digits recognition:
• Acoustic parameters : MFCC • Context independent HMMs • Decoding : Time synchronous
algorithm ■ Sound effect
– Noise : Babble ■ Recognition experiments
Page 7 ATSIP, Sousse, May 18th, 2014
Video processing
■ Video extraction ■ Lips localisation ■ Images interpolation (same frequency as speech) ■ Features extraction
• DCT and DCT2 (DCT+LDA) • Projections : PRO et PRO2
(PRO+LDA) ■ Recognition experiments
Page 8 ATSIP, Sousse, May 18th, 2014
Fusion techniques
q Parameters fusion : • Concatenation
• Dimension decrease : Linear Discriminant Analysis (LDA) • Modelisation : classical HMM with one stream
q Scores fusion : Multi-stream HMM
Page 9 ATSIP, Sousse, May 18th, 2014
Experimental results : parameters fusion
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10S/N
%A
ccur
acy
Speech onlyVideo only : Pro2Video only : DCT2AV Fusion : Pro2AV Fusion : DCT2
Page 10 ATSIP, Sousse, May 18th, 2014
Experimental results : Scores fusion at -5db
42
43
44
45
46
47
48
49
50
51
52
Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
Page 11 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Fusion of face and speech for identity verification ■ Detection of possible forgeries ■ Compulsory ? for:
– Homeland/firms security: restricted access,… – Secured computer login – Secured on-line signature of contracts
Page 12 ATSIP, Sousse, May 18th, 2014 12
Talking-face and 2D face sequence database
■ Data: video sequences (.avi) in which a short phrase in English is pronounced / duration ≈ 10s (actual speech duration ≈ 2s)
■ Audio-video data used for talking faces evaluations ■ Same sequences used for 2D face from video sequences evaluations ■ 430 subjects pronounced 4 phrases :
– from a set 430 English phrases – 2 indoor video files acquired during the first session – 2 outdoor video files acquired during the second session – realistic forgeries created a posteriori
Page 13 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Features
Raw Pixel
Value
DCT Transform
Shape Related
Many Others
…
Raw amplitude
« Classical » MFCC coefficients
Many others
Page 14 ATSIP, Sousse, May 18th, 2014
Audio-Visual
Audio-Visual Subspaces
Audio
Visual
Reduced Audiovisual Subspace
Principal Component & Linear Discriminant
Analysis
x
Correlated Audio & Visual Subspaces
Co-inertia & Canonical Correlation Analysis
Page 15 ATSIP, Sousse, May 18th, 2014
Correspondence Measures
Audiovisual subspace Correlated subspaces
Gaussian Mixture Models
Neural Networks
Coupled HMM
Correlation
Mutual Information
Page 16 ATSIP, Sousse, May 18th, 2014
Application to indexation
■ High-level requests – “Find videos where John Doe is speaking” – “Find dialogues between Mr X and Mrs Y” – “Locate the singer in this music video”
Raw Energy
Raw Pixel
Value Correlation
Page 17 ATSIP, Sousse, May 18th, 2014
Who is speaking? ■ Face tracking ■ Correlation
– Pixel of each face – Raw audio energy
■ Find maximum synchrony
Green: current speaker
Page 18 ATSIP, Sousse, May 18th, 2014
How to Perform “Talking-‐Face” Authen:ca:on?
Face recognition
Speaker verification Score fusion
What if…?
OK
OK OK
Deliberate imposture
Page 19 ATSIP, Sousse, May 18th, 2014
Biometrics
■ Identity Verification with Talking Faces – Speaker Verification – Face Recognition
■ What if?
Face
OK Voice
OK
NO X
Page 20 ATSIP, Sousse, May 18th, 2014
Identity Verification
Enrolment of client λ
Model for client λ
Person ε pretending to be client λ
accepted if
rejected otherwise
Co-Inertia Analysis
Equal Error Rate: 30 %
Page 21 ATSIP, Sousse, May 18th, 2014
Test
Replay Attacks Detection
Training
Co-IA CCA
accepted if
rejected otherwise
Sync Model
Page 22 ATSIP, Sousse, May 18th, 2014
Replay Attacks Detection
Genuine synchronized video Audio replay attack Lips do not match audio perfectly
Equal Error Rate: 14 %
Page 23 ATSIP, Sousse, May 18th, 2014
Example of Replay attacks
Page 24 ATSIP, Sousse, May 18th, 2014
delayed video delayed audio
-5 0 +5
Alignment by maximum correlation
-1
Page 25 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Available features
– Face : Face features (lip, eyes) à Face Modality – Speech à Speech Modality – Speech Synchrony à Synchrony Modality
Video
Page 26 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Face modality – Detection:
• Generative models (MPT toolbox) • Temporal median Filtering • Eyes detection within faces
– Normalization: geometry + illumination
Page 27 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification ■ Face Modality:
– Two verification strategies and one single comparison framework
• Global = Eigenfaces: – Calculation of a set of directions (eigenfaces)
defining a projection space – Two faces are compared regarding their
projection on the eigenfaces space. – Learning data: BIOMET (130 pers.) + BANCA
(30 pers.)
Page 28 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Face Modality: • SIFT descriptors:
– Keypoints extraction – Keypoints representation: 128-dimensional
vector (gradient orientation histogramme,…) + 4-dimensional position vector
SIFT descriptor (dim 128)
Position (x,y) + scale + orientation (dim 4)
Page 29 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification ■ Face Modality:
• SVD-based matching method: – Compare two videos V1 and V2 – Exclusive principle: One-to-one correspondences
between » Faces (global) » Descriptors (local)
– Principle: » Proximity matrix computation between faces or
descriptors » Extraction of good pairings (made easy by SVD
computation) – Scores:
» One matching score between global representations
» One matching score between local representations
Page 30 ATSIP, Sousse, May 18th, 2014
Variability !!!!
Page 31 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Speech Modality: – GMM-based approach;
• One world model • Each speaker model is derived from the
World Model by MAP adaptation • Speech verification score: derived from
likelihood ratio
Page 32 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Synchrony Modality: – Principle: synchrony between lips and
speech carries identity information – Process:
• Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal)
• Comparison of the test sample with the synchrony model
Page 33 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification ■ Experiments:
– BANCA database: • 52 persons divided into two groups (G1 and G2) • 3 recording conditions • 1 person à 8 recordings (4 client accesses, 4
impostor accesses) • Evaluation based on P protocol: 234 client
accesses and 312 impostor accesses – Scores:
• 4 scores per access (PCA face, SIFT face, speech, synchrony)
• Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)
Page 34 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Experiments:
Page 35 ATSIP, Sousse, May 18th, 2014
SecurePhone
■ Technical solution that improves security
■ Biometric recognition – Makes use of VOICE, FACE and SIGNATURE
■ Electronic signature used to secure information exchange
Page 36 ATSIP, Sousse, May 18th, 2014
Biometrics in SecurePhone
■ Operation
Pre-processing
Modelling Modelling
Modelling
Pre-processing Pre-processing
Access Denied Access Granted
FUSION
Face Voice Written Signature
Modelling
Page 37 ATSIP, Sousse, May 18th, 2014
The BioSecure Multimodal Evaluation Campaign ■ Launched in April 2007 ■ Many modalities including ‘Video sequences’ and
‘Talking Faces’ ■ Development data and reference systems available ■ Evaluations on the sequestrated BioSecure data base
(1000 clients) ■ Debriefing workshop ■ More info on : http://www.int-evry.fr/biometrics/BMEC2007/index.php
Page 38 ATSIP, Sousse, May 18th, 2014
Audio-‐visual forgery scenarios
■ Low-‐effort – “Paparazzi” scenario
• The impostor owns a picture of the face and a recording of the voice of the target
– “Big Brother” scenario • The impostor owns a video of the face and a recording of the voice of the target
■ High-‐effort – “Imitator” scenario
• The impostor owns a recording of the voice of the target and transforms his own voice to sound like the target
– “Playback” scenario • The impostor owns a picture of the face of the target and animate it according to his own face moAon
– “Ventriloquist” scenario • combines the two previous ones
Page 39 ATSIP, Sousse, May 18th, 2014
Detec:on of imposture
Face modality: ACCEPTED!
Voice modality: ACCEPTED!
Synchronisation: DENIED!
Page 40 ATSIP, Sousse, May 18th, 2014 40
Audio replay + “random” face
Talking-Face forgeries @ BMEC
Audio replay attack " Assumptions
§ Forger has recorded speech data from the genuine user in outdoor (test) conditions
§ Forger is replaying the audio and uses his face in front of the sensor
Stolen wave Audio replay + forger face
Page 41 ATSIP, Sousse, May 18th, 2014 41
CRAZY TALK Face animation + TTS Talking-Face forgeries @ BMEC
Replay attack " Assumptions
§ Forger has stolen a picture § Forger uses a face animation software and TTS (male or
female) § Forger plays back the animation to the sensor
Stolen picture Contour detection Generated avi
Page 42 ATSIP, Sousse, May 18th, 2014 42
Picture presentation + TTS forgeries
Talking-Face forgeries @ BMEC
Replay attack " Assumptions
§ Forger has stolen a picture § Forger has printed the picture § Forger present the picture to the sensor and uses TTS
(same wave as for the face animation forgery)
Stolen picture Presented picture
Page 43 ATSIP, Sousse, May 18th, 2014 43
Systems with fusion of (face, speech)
face score
speech score
fusion score
video sequence
frames
speech signal
Face verification
Speaker verification
Page 44 ATSIP, Sousse, May 18th, 2014 44
Voice Conversion methods ■ GMM conversion
– Training of a joined Gaussian model • parallel corpus of aligned sentences of both source and target voice
• MFCC on HNM (Harmonic plus Noise Model) parameterizaAon – Speech synthesis from Gaussian model
• Inversion of the MFCC • Pitch correcAon
■ ALISP conversion – Very low debit speech compression (500 bps) method
• Originally developed by TELECOM-‐ParisTech – Indexed segments dicAonary system (of the target voice) – HNM parameterizaAon
Page 45 ATSIP, Sousse, May 18th, 2014
Voice conversion techniques
Definition: Process of making one person’s voice « source » sounds like another person’s voice target
source target
Voice conversion
My name is John My name is John
Page 46 ATSIP, Sousse, May 18th, 2014
Principle of ALISP
Dictionary of representative
segments
Dictionary of representative
segments
Spectral analysis
Prosodic analysis
Selection of segmental units
Segment index
Prosodic parameters
Input speech
Concatenative synthesis
HNM
Output speech
CODER
Page 47 ATSIP, Sousse, May 18th, 2014
Details of Encoding
speech Spectral analysis
Prosodic analysis
HMM Recognition
Dictionary of HMM models of ALISP classes
Synth unit A1
… Synth unit A8
HMM A
Representative units of the
class
Selection by DTW
Prosodic encoding
Index of ALISP class
Index of synth. unit
Pitch, energy, duration
Page 48 ATSIP, Sousse, May 18th, 2014
Details of decoding
Output speech
Synth unit A1
… Synth unit A8
ALISP Index
Synth unit index within class
Prosodic parameters
Loading Synth unit
Concatenative synthesis
Page 49 ATSIP, Sousse, May 18th, 2014
Principle of Alisp conversion
Learning step: one hour of target voice
- Parametric analysis: MFCC - Segmentation based on temporal decompostion and vector quantization - Stochastic modelling based on HMM - Creation of representative units
Conversion step
- Parametric analysis: MFCC - HMM recognition - Selection of representative segment à DTW
Synthesis step
- Concatenation of representative - HNM synthesis
Page 50 ATSIP, Sousse, May 18th, 2014
Voice conversion using ALISP results
BREF database NIST database
Source
Result
Target Source Target
Result
female female female male
Page 51 ATSIP, Sousse, May 18th, 2014
Demonstra:on of Voice Conversion
Impostor voice Converted voice with GMM Converted voice with ALISP
Target voice Converted voice with ALISP+GMM
Page 52 ATSIP, Sousse, May 18th, 2014
3D reconstruction • 3D face modeling from a front and a profile shot :
• Animated face
• https://picoforge.int-evry.fr/cgi-bin/twiki/view/Myblog3d/Web/Demos
Page 53 ATSIP, Sousse, May 18th, 2014
Face Tranformation
Control point selection
Image segmentation
Figure 2: Division of an image Figure 1: Control points selec8on
Linear transformation
between source and target image
Blending step
source
target
Page 54 ATSIP, Sousse, May 18th, 2014
Face Transformation
Source
?
54
-‐> LocalisaAon of control points
-‐> Warping -‐> Blending
Cible
? X’ = f(X)
p = αp + (1 – α)p’
X X’
p p’
Page 55 ATSIP, Sousse, May 18th, 2014
Face transforma:on (IBM)
Page 56 ATSIP, Sousse, May 18th, 2014
Ouisper1 - Silent Speech Interface
■ Sensor-based system allowing speech communication via standard articulators, but without glottal activity
■ Two distinct types of application – alternative to tracheo-oesophagal speech (TES) for persons
having undergone a tracheotomy – a "silent telephone" for use in situations where quiet must be
maintained, or for communication in very noisy environments
■ Speech Synthesis from ultrasound and optical imagery of the tongue and lips
1) Oral Ultrasound synthetIc SPEech souRce
Page 57 ATSIP, Sousse, May 18th, 2014
Ouisper - System Overview
Ultrasound video of the vocal tract
Optical video of the speaker lips
Recorded audio
Speech Alignment
Text
Visual Feature Extraction
Audio-Visual Speech Corpus
Visual Speech Recognizer
Visual Unit Selection
Audio Unit Concatenation
TRAINING
TEST
Visual Data
N-best
Phonetic or ALISP Targets
Page 58 ATSIP, Sousse, May 18th, 2014
Ouisper - Training Data
Page 59 ATSIP, Sousse, May 18th, 2014
Ouisper - Video Stream Coding
T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007.
Eigenvectors
Build a subset of
typical frames
Perform PCA
Code new frames with their projections onto the set of Eigenvectors
Page 60 ATSIP, Sousse, May 18th, 2014
Ouisper - Audio Stream Coding
ALISP Segmentation
Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques
Phonetic Segmentation
Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal
Corpus-based synthesis
Need of a preliminary segmental description of the signal
Page 61 ATSIP, Sousse, May 18th, 2014
Audiovisual dictionary building
■ Visual and acoustic data are synchronously recorded
■ Audio segmentation is used to bootstrap visual speech recognizer
/e -‐ r/
2) Train HMM model for each phonetic class
/a -‐ j//u -‐ th/
Audiovisual dictionary
Page 62 ATSIP, Sousse, May 18th, 2014
Visuo-acoustic decoding
■ Visual speech recognition – Train HMM model for each visual class
• Use multistream-based learning techniques
– Perform a « visuo-phonetic » decoding step • Use N-Best list • Introduce linguistic constraints
– Language model – Dictionary – Multigrams
■ Corpus-based speech synthesis – Combine probabilistic and data-driven approach in the
audiovisual unit selection step.
Page 63 ATSIP, Sousse, May 18th, 2014
Speech recognition from video-only data
ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh
ax w ih y uh r b uh k sh uw dh ax v er s p ey jh
Open your book to the first page
Ref
Rec
A wear your book shoe the verse page
Corpus-based synthesis driven by predicted phonetic lattice is currently under study
Page 64 ATSIP, Sousse, May 18th, 2014
Ouisper - Conclusion
■ More information on – http://www.neurones.espci.fr/ouisper/
■ Contacts – [email protected] – [email protected] – [email protected]
Page 65 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Processing Conclusions and Perspectives
■ A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.
■ The combined use of facial and speech information improves identity verification and robustness to forgeries.
■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.
Page 66 ATSIP, Sousse, May 18th, 2014