66
Audio-Visual Speech Processing Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber, Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari ATSIP, Sousse, March 18th 2014

Atsip avsp17

  • View
    4

  • Download
    0

Embed Size (px)

DESCRIPTION

Audio-Visual Speech Processing presentation at the IEEE conference on Advanced Technologies for Signal and Image Processing in Sousse, Tunisia, March 18th, 2014

Citation preview

Page 1: Atsip avsp17

Audio-Visual Speech Processing

Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber,

Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari

ATSIP, Sousse, March 18th 2014

Page 2: Atsip avsp17

Page 2 ATSIP, Sousse, May 18th, 2014

Some motivations,…

■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.

■  The combined use of facial and speech information improves identity verification and robustness to forgeries.

■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.

■  SmartPhones, VisioPhones, WebPhones, SecurePhones, Visio Conferences, Virtual Reality worlds are gaining popularity.

Page 3: Atsip avsp17

Page 3 ATSIP, Sousse, May 18th, 2014

Some topics under study,…

■  Audio-visual speech recognition –  Automatic ‘lip-reading’

■  Audio-visual speaker verification –  Detection of forgeries

■  Speech driven animation of the face –  Could we look and sound like somebody else ?

■  Speaker indexing –  ‘Who is talking in a video sequence ?’

■ OUISPER : a silent speech interface –  Corpus based synthesis from tongue and lips

Page 4: Atsip avsp17

Page 4 ATSIP, Sousse, May 18th, 2014

Audio Visual Speech Recognition

Dictionary Grammar

Acoustic models

Features extraction

Decoder

Page 5: Atsip avsp17

Page 5 ATSIP, Sousse, May 18th, 2014

Video Mike (IBM, 2004)

■  IBM

■  2004

Page 6: Atsip avsp17

Page 6 ATSIP, Sousse, May 18th, 2014

Audio processing

■  Features  extraction   ■  Digits  detection ■  Digits  recognition:    

•  Acoustic  parameters  :  MFCC •  Context  independent    HMMs •   Decoding  :  Time  synchronous  

algorithm ■  Sound  effect

–  Noise  :  Babble ■  Recognition  experiments

Page 7: Atsip avsp17

Page 7 ATSIP, Sousse, May 18th, 2014

Video processing

■  Video  extraction ■  Lips  localisation ■  Images  interpolation   (same  frequency  as  speech) ■  Features  extraction

•  DCT  and  DCT2  (DCT+LDA) •  Projections    :  PRO  et  PRO2  

(PRO+LDA) ■  Recognition  experiments

Page 8: Atsip avsp17

Page 8 ATSIP, Sousse, May 18th, 2014

Fusion techniques

q  Parameters fusion : •  Concatenation

•  Dimension decrease : Linear Discriminant Analysis (LDA) •  Modelisation : classical HMM with one stream

q  Scores fusion : Multi-stream HMM

Page 9: Atsip avsp17

Page 9 ATSIP, Sousse, May 18th, 2014

Experimental results : parameters fusion

0

10

20

30

40

50

60

70

80

90

100

-15 -10 -5 0 5 10S/N

%A

ccur

acy

Speech onlyVideo only : Pro2Video only : DCT2AV Fusion : Pro2AV Fusion : DCT2

Page 10: Atsip avsp17

Page 10 ATSIP, Sousse, May 18th, 2014

Experimental results : Scores fusion at -5db

42

43

44

45

46

47

48

49

50

51

52

Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2

Page 11: Atsip avsp17

Page 11 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification

■  Fusion of face and speech for identity verification ■  Detection of possible forgeries ■  Compulsory ? for:

–  Homeland/firms security: restricted access,… –  Secured computer login –  Secured on-line signature of contracts

Page 12: Atsip avsp17

Page 12 ATSIP, Sousse, May 18th, 2014 12

Talking-face and 2D face sequence database

■  Data: video sequences (.avi) in which a short phrase in English is pronounced / duration ≈ 10s (actual speech duration ≈ 2s)

■  Audio-video data used for talking faces evaluations ■  Same sequences used for 2D face from video sequences evaluations ■  430 subjects pronounced 4 phrases :

–  from a set 430 English phrases –  2 indoor video files acquired during the first session –  2 outdoor video files acquired during the second session –  realistic forgeries created a posteriori

Page 13: Atsip avsp17

Page 13 ATSIP, Sousse, May 18th, 2014

Audio-Visual Speech Features

Raw Pixel

Value

DCT Transform

Shape Related

Many Others

Raw amplitude

« Classical » MFCC coefficients

Many others

Page 14: Atsip avsp17

Page 14 ATSIP, Sousse, May 18th, 2014

Audio-Visual

Audio-Visual Subspaces

Audio

Visual

Reduced Audiovisual Subspace

Principal Component & Linear Discriminant

Analysis

x

Correlated Audio & Visual Subspaces

Co-inertia & Canonical Correlation Analysis

Page 15: Atsip avsp17

Page 15 ATSIP, Sousse, May 18th, 2014

Correspondence Measures

Audiovisual subspace Correlated subspaces

Gaussian Mixture Models

Neural Networks

Coupled HMM

Correlation

Mutual Information

Page 16: Atsip avsp17

Page 16 ATSIP, Sousse, May 18th, 2014

Application to indexation

■  High-level requests –  “Find videos where John Doe is speaking” –  “Find dialogues between Mr X and Mrs Y” –  “Locate the singer in this music video”

Raw Energy

Raw Pixel

Value Correlation

Page 17: Atsip avsp17

Page 17 ATSIP, Sousse, May 18th, 2014

Who is speaking? ■  Face tracking ■  Correlation

–  Pixel of each face –  Raw audio energy

■  Find maximum synchrony

Green: current speaker

Page 18: Atsip avsp17

Page 18 ATSIP, Sousse, May 18th, 2014

How  to  Perform  “Talking-­‐Face”  Authen:ca:on?  

Face recognition

Speaker verification Score fusion

What if…?

OK

OK OK

Deliberate imposture

Page 19: Atsip avsp17

Page 19 ATSIP, Sousse, May 18th, 2014

Biometrics

■  Identity Verification with Talking Faces –  Speaker Verification –  Face Recognition

■ What if?

Face

OK Voice

OK

NO X

Page 20: Atsip avsp17

Page 20 ATSIP, Sousse, May 18th, 2014

Identity Verification

Enrolment of client λ

Model for client λ

Person ε pretending to be client λ

accepted if

rejected otherwise

Co-Inertia Analysis

Equal Error Rate: 30 %

Page 21: Atsip avsp17

Page 21 ATSIP, Sousse, May 18th, 2014

Test

Replay Attacks Detection

Training

Co-IA CCA

accepted if

rejected otherwise

Sync Model

Page 22: Atsip avsp17

Page 22 ATSIP, Sousse, May 18th, 2014

Replay Attacks Detection

Genuine synchronized video Audio replay attack Lips do not match audio perfectly

Equal Error Rate: 14 %

Page 23: Atsip avsp17

Page 23 ATSIP, Sousse, May 18th, 2014

Example of Replay attacks

Page 24: Atsip avsp17

Page 24 ATSIP, Sousse, May 18th, 2014

delayed video delayed audio

-5 0 +5

Alignment by maximum correlation

-1

Page 25: Atsip avsp17

Page 25 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification

■  Available features

–  Face : Face features (lip, eyes) à Face Modality –  Speech à Speech Modality –  Speech Synchrony à Synchrony Modality

Video

Page 26: Atsip avsp17

Page 26 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification

■  Face modality –  Detection:

•  Generative models (MPT toolbox) •  Temporal median Filtering •  Eyes detection within faces

–  Normalization: geometry + illumination

Page 27: Atsip avsp17

Page 27 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification ■  Face Modality:

–  Two verification strategies and one single comparison framework

•  Global = Eigenfaces: – Calculation of a set of directions (eigenfaces)

defining a projection space –  Two faces are compared regarding their

projection on the eigenfaces space. –  Learning data: BIOMET (130 pers.) + BANCA

(30 pers.)

Page 28: Atsip avsp17

Page 28 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification

■  Face Modality: •  SIFT descriptors:

–  Keypoints extraction –  Keypoints representation: 128-dimensional

vector (gradient orientation histogramme,…) + 4-dimensional position vector

SIFT descriptor (dim 128)

Position (x,y) + scale + orientation (dim 4)

Page 29: Atsip avsp17

Page 29 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification ■  Face Modality:

•  SVD-based matching method: –  Compare two videos V1 and V2 –  Exclusive principle: One-to-one correspondences

between »  Faces (global) »  Descriptors (local)

–  Principle: »  Proximity matrix computation between faces or

descriptors »  Extraction of good pairings (made easy by SVD

computation) –  Scores:

»  One matching score between global representations

»  One matching score between local representations

Page 30: Atsip avsp17

Page 30 ATSIP, Sousse, May 18th, 2014

Variability !!!!

Page 31: Atsip avsp17

Page 31 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification

■  Speech Modality: –  GMM-based approach;

•  One world model •  Each speaker model is derived from the

World Model by MAP adaptation •  Speech verification score: derived from

likelihood ratio

Page 32: Atsip avsp17

Page 32 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification

■  Synchrony Modality: –  Principle: synchrony between lips and

speech carries identity information –  Process:

•  Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal)

•  Comparison of the test sample with the synchrony model

Page 33: Atsip avsp17

Page 33 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification ■  Experiments:

–  BANCA database: •  52 persons divided into two groups (G1 and G2) •  3 recording conditions •  1 person à 8 recordings (4 client accesses, 4

impostor accesses) •  Evaluation based on P protocol: 234 client

accesses and 312 impostor accesses –  Scores:

•  4 scores per access (PCA face, SIFT face, speech, synchrony)

•  Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)

Page 34: Atsip avsp17

Page 34 ATSIP, Sousse, May 18th, 2014

Audiovisual identity verification

■  Experiments:

Page 35: Atsip avsp17

Page 35 ATSIP, Sousse, May 18th, 2014

SecurePhone

■  Technical solution that improves security

■ Biometric recognition –  Makes use of VOICE, FACE and SIGNATURE

■  Electronic signature used to secure information exchange

Page 36: Atsip avsp17

Page 36 ATSIP, Sousse, May 18th, 2014

Biometrics in SecurePhone

■ Operation

Pre-processing

Modelling Modelling

Modelling

Pre-processing Pre-processing

Access Denied Access Granted

FUSION

Face Voice Written Signature

Modelling

Page 37: Atsip avsp17

Page 37 ATSIP, Sousse, May 18th, 2014

The BioSecure Multimodal Evaluation Campaign ■  Launched in April 2007 ■ Many modalities including ‘Video sequences’ and

‘Talking Faces’ ■  Development data and reference systems available ■  Evaluations on the sequestrated BioSecure data base

(1000 clients) ■  Debriefing workshop ■ More info on : http://www.int-evry.fr/biometrics/BMEC2007/index.php

Page 38: Atsip avsp17

Page 38 ATSIP, Sousse, May 18th, 2014

Audio-­‐visual  forgery  scenarios  

■  Low-­‐effort  –  “Paparazzi”  scenario  

•  The  impostor  owns  a  picture  of  the  face  and  a  recording  of  the  voice  of  the  target  

–  “Big  Brother”  scenario  •  The  impostor  owns  a  video  of  the  face  and  a  recording  of  the  voice  of  the  target  

■  High-­‐effort  –  “Imitator”  scenario  

•  The  impostor  owns  a  recording  of  the  voice  of  the  target  and  transforms  his  own  voice  to  sound  like  the  target  

–  “Playback”  scenario  •  The  impostor  owns  a  picture  of  the  face  of  the  target  and  animate  it  according  to  his  own  face  moAon  

–  “Ventriloquist”  scenario  •  combines  the  two  previous  ones  

Page 39: Atsip avsp17

Page 39 ATSIP, Sousse, May 18th, 2014

Detec:on  of  imposture  

Face modality: ACCEPTED!

Voice modality: ACCEPTED!

Synchronisation: DENIED!

Page 40: Atsip avsp17

Page 40 ATSIP, Sousse, May 18th, 2014 40

Audio replay + “random” face

Talking-Face forgeries @ BMEC

Audio replay attack "   Assumptions

§  Forger has recorded speech data from the genuine user in outdoor (test) conditions

§  Forger is replaying the audio and uses his face in front of the sensor

Stolen wave Audio replay + forger face

Page 41: Atsip avsp17

Page 41 ATSIP, Sousse, May 18th, 2014 41

CRAZY TALK Face animation + TTS Talking-Face forgeries @ BMEC

Replay attack "   Assumptions

§  Forger has stolen a picture §  Forger uses a face animation software and TTS (male or

female) §  Forger plays back the animation to the sensor

Stolen picture Contour detection Generated avi

Page 42: Atsip avsp17

Page 42 ATSIP, Sousse, May 18th, 2014 42

Picture presentation + TTS forgeries

Talking-Face forgeries @ BMEC

Replay attack "   Assumptions

§  Forger has stolen a picture §  Forger has printed the picture §  Forger present the picture to the sensor and uses TTS

(same wave as for the face animation forgery)

Stolen picture Presented picture

Page 43: Atsip avsp17

Page 43 ATSIP, Sousse, May 18th, 2014 43

Systems with fusion of (face, speech)

face score

speech score

fusion score

video sequence

frames

speech signal

Face verification

Speaker verification

Page 44: Atsip avsp17

Page 44 ATSIP, Sousse, May 18th, 2014 44

Voice Conversion methods ■ GMM  conversion  

–  Training  of  a  joined  Gaussian  model  •   parallel  corpus  of  aligned  sentences  of  both  source  and  target  voice  

•   MFCC  on  HNM  (Harmonic  plus  Noise  Model)  parameterizaAon    –  Speech  synthesis  from  Gaussian  model  

•   Inversion  of  the  MFCC  •   Pitch  correcAon  

■ ALISP  conversion  –  Very  low  debit  speech  compression  (500  bps)  method  

•   Originally  developed  by  TELECOM-­‐ParisTech  –  Indexed  segments  dicAonary  system  (of  the  target  voice)  –  HNM  parameterizaAon  

Page 45: Atsip avsp17

Page 45 ATSIP, Sousse, May 18th, 2014

Voice conversion techniques

Definition: Process of making one person’s voice « source » sounds like another person’s voice target

source target

Voice conversion

My name is John My name is John

Page 46: Atsip avsp17

Page 46 ATSIP, Sousse, May 18th, 2014

Principle of ALISP

Dictionary of representative

segments

Dictionary of representative

segments

Spectral analysis

Prosodic analysis

Selection of segmental units

Segment index

Prosodic parameters

Input speech

Concatenative synthesis

HNM

Output speech

CODER

Page 47: Atsip avsp17

Page 47 ATSIP, Sousse, May 18th, 2014

Details of Encoding

speech Spectral analysis

Prosodic analysis

HMM Recognition

Dictionary of HMM models of ALISP classes

Synth unit A1

… Synth unit A8

HMM A

Representative units of the

class

Selection by DTW

Prosodic encoding

Index of ALISP class

Index of synth. unit

Pitch, energy, duration

Page 48: Atsip avsp17

Page 48 ATSIP, Sousse, May 18th, 2014

Details of decoding

Output speech

Synth unit A1

… Synth unit A8

ALISP Index

Synth unit index within class

Prosodic parameters

Loading Synth unit

Concatenative synthesis

Page 49: Atsip avsp17

Page 49 ATSIP, Sousse, May 18th, 2014

Principle of Alisp conversion

Learning step: one hour of target voice

-  Parametric analysis: MFCC -  Segmentation based on temporal decompostion and vector quantization -  Stochastic modelling based on HMM -  Creation of representative units

Conversion step

- Parametric analysis: MFCC -  HMM recognition -  Selection of representative segment à DTW

Synthesis step

-  Concatenation of representative -  HNM synthesis

Page 50: Atsip avsp17

Page 50 ATSIP, Sousse, May 18th, 2014

Voice conversion using ALISP results

BREF database NIST database

Source

Result

Target Source Target

Result

female female female male

Page 51: Atsip avsp17

Page 51 ATSIP, Sousse, May 18th, 2014

Demonstra:on  of  Voice  Conversion  

Impostor voice Converted voice with GMM Converted voice with ALISP

Target voice Converted voice with ALISP+GMM

Page 52: Atsip avsp17

Page 52 ATSIP, Sousse, May 18th, 2014

3D reconstruction •  3D face modeling from a front and a profile shot :

•  Animated face

•  https://picoforge.int-evry.fr/cgi-bin/twiki/view/Myblog3d/Web/Demos

Page 53: Atsip avsp17

Page 53 ATSIP, Sousse, May 18th, 2014

Face Tranformation

Control point selection

Image segmentation

Figure  2:  Division  of  an  image    Figure  1:  Control  points  selec8on  

Linear transformation

between source and target image

Blending step

source

target

Page 54: Atsip avsp17

Page 54 ATSIP, Sousse, May 18th, 2014

Face Transformation

Source  

?  

54  

-­‐>  LocalisaAon  of  control  points  

-­‐>  Warping   -­‐>  Blending  

Cible  

?  X’  =  f(X)  

p  =  αp  +  (1  –  α)p’  

X X’  

p   p’  

Page 55: Atsip avsp17

Page 55 ATSIP, Sousse, May 18th, 2014

Face  transforma:on  (IBM)  

Page 56: Atsip avsp17

Page 56 ATSIP, Sousse, May 18th, 2014

Ouisper1 - Silent Speech Interface

■  Sensor-based system allowing speech communication via standard articulators, but without glottal activity

■  Two distinct types of application –  alternative to tracheo-oesophagal speech (TES) for persons

having undergone a tracheotomy –  a "silent telephone" for use in situations where quiet must be

maintained, or for communication in very noisy environments

■  Speech Synthesis from ultrasound and optical imagery of the tongue and lips

1) Oral Ultrasound synthetIc SPEech souRce

Page 57: Atsip avsp17

Page 57 ATSIP, Sousse, May 18th, 2014

Ouisper - System Overview

Ultrasound video of the vocal tract

Optical video of the speaker lips

Recorded audio

Speech Alignment

Text

Visual Feature Extraction

Audio-Visual Speech Corpus

Visual Speech Recognizer

Visual Unit Selection

Audio Unit Concatenation

TRAINING

TEST

Visual Data

N-best

Phonetic or ALISP Targets

Page 58: Atsip avsp17

Page 58 ATSIP, Sousse, May 18th, 2014

Ouisper - Training Data

Page 59: Atsip avsp17

Page 59 ATSIP, Sousse, May 18th, 2014

Ouisper - Video Stream Coding

T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007.

Eigenvectors

Build a subset of

typical frames

Perform PCA

Code new frames with their projections onto the set of Eigenvectors

Page 60: Atsip avsp17

Page 60 ATSIP, Sousse, May 18th, 2014

Ouisper - Audio Stream Coding

ALISP Segmentation

Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques

Phonetic Segmentation

Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal

Corpus-based synthesis

Need of a preliminary segmental description of the signal

Page 61: Atsip avsp17

Page 61 ATSIP, Sousse, May 18th, 2014

Audiovisual dictionary building

■  Visual and acoustic data are synchronously recorded

■  Audio segmentation is used to bootstrap visual speech recognizer

/e  -­‐  r/

2)    Train  HMM  model  for  each  phonetic  class

/a  -­‐  j//u  -­‐  th/

Audiovisual dictionary

Page 62: Atsip avsp17

Page 62 ATSIP, Sousse, May 18th, 2014

Visuo-acoustic decoding

■  Visual speech recognition –  Train HMM model for each visual class

•  Use multistream-based learning techniques

–  Perform a « visuo-phonetic » decoding step •  Use N-Best list •  Introduce linguistic constraints

–  Language model –  Dictionary –  Multigrams

■  Corpus-based speech synthesis –  Combine probabilistic and data-driven approach in the

audiovisual unit selection step.

Page 63: Atsip avsp17

Page 63 ATSIP, Sousse, May 18th, 2014

Speech recognition from video-only data

ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh

ax w ih y uh r b uh k sh uw dh ax v er s p ey jh

Open your book to the first page

Ref

Rec

A wear your book shoe the verse page

Corpus-based synthesis driven by predicted phonetic lattice is currently under study

Page 64: Atsip avsp17

Page 64 ATSIP, Sousse, May 18th, 2014

Ouisper - Conclusion

■ More information on –  http://www.neurones.espci.fr/ouisper/

■  Contacts –  [email protected] –  [email protected] –  [email protected]

Page 65: Atsip avsp17

Page 65 ATSIP, Sousse, May 18th, 2014

Audio-Visual Speech Processing Conclusions and Perspectives

■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.

■  The combined use of facial and speech information improves identity verification and robustness to forgeries.

■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.

Page 66: Atsip avsp17

Page 66 ATSIP, Sousse, May 18th, 2014