Atsip avsp17

Audio-Visual Speech Processing

Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber,

Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari

ATSIP, Sousse, March 18th 2014

Page 2 ATSIP, Sousse, May 18th, 2014

Some motivations,…

■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.

■  The combined use of facial and speech information improves identity verification and robustness to forgeries.

■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.

■  SmartPhones, VisioPhones, WebPhones, SecurePhones, Visio Conferences, Virtual Reality worlds are gaining popularity.


Some topics under study,…

■  Audio-visual speech recognition –  Automatic ‘lip-reading’

■  Audio-visual speaker verification –  Detection of forgeries

■  Speech driven animation of the face –  Could we look and sound like somebody else ?

■  Speaker indexing –  ‘Who is talking in a video sequence ?’

■ OUISPER : a silent speech interface –  Corpus based synthesis from tongue and lips


Audio Visual Speech Recognition

Dictionary Grammar

Acoustic models

Features extraction

Decoder


Video Mike (IBM, 2004)

■  IBM

■  2004


Audio processing

■  Features extraction ■  Digits detection ■  Digits recognition:

•  Acoustic parameters : MFCC •  Context independent HMMs •  Decoding : Time synchronous

algorithm ■  Sound effect

–  Noise : Babble ■  Recognition experiments


Video processing

■  Video extraction ■  Lips localisation ■  Images interpolation (same frequency as speech) ■  Features extraction

•  DCT and DCT2 (DCT+LDA) •  Projections : PRO et PRO2

(PRO+LDA) ■  Recognition experiments


Fusion techniques

q  Parameters fusion : •  Concatenation

•  Dimension decrease : Linear Discriminant Analysis (LDA) •  Modelisation : classical HMM with one stream

q  Scores fusion : Multi-stream HMM


Experimental results : parameters fusion

0

10

20

30

40

50

60

70

80

90

100

-15 -10 -5 0 5 10S/N

%A

ccur

acy

Speech onlyVideo only : Pro2Video only : DCT2AV Fusion : Pro2AV Fusion : DCT2


Experimental results : Scores fusion at -5db

42

43

44

45

46

47

48

49

50

51

52

Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2


Audiovisual identity verification

■  Fusion of face and speech for identity verification ■  Detection of possible forgeries ■  Compulsory ? for:

–  Homeland/firms security: restricted access,… –  Secured computer login –  Secured on-line signature of contracts

Page 12 ATSIP, Sousse, May 18th, 2014 12

Talking-face and 2D face sequence database

■  Data: video sequences (.avi) in which a short phrase in English is pronounced / duration ≈ 10s (actual speech duration ≈ 2s)

■  Audio-video data used for talking faces evaluations ■  Same sequences used for 2D face from video sequences evaluations ■  430 subjects pronounced 4 phrases :

–  from a set 430 English phrases –  2 indoor video files acquired during the first session –  2 outdoor video files acquired during the second session –  realistic forgeries created a posteriori


Audio-Visual Speech Features

Raw Pixel

Value

DCT Transform

Shape Related

Many Others

…

Raw amplitude

« Classical » MFCC coefficients

Many others


Audio-Visual

Audio-Visual Subspaces

Audio

Visual

Reduced Audiovisual Subspace

Principal Component & Linear Discriminant

Analysis

x

Correlated Audio & Visual Subspaces

Co-inertia & Canonical Correlation Analysis


Correspondence Measures

Audiovisual subspace Correlated subspaces

Gaussian Mixture Models

Neural Networks

Coupled HMM

Correlation

Mutual Information


Application to indexation

■  High-level requests –  “Find videos where John Doe is speaking” –  “Find dialogues between Mr X and Mrs Y” –  “Locate the singer in this music video”

Raw Energy

Raw Pixel

Value Correlation


Who is speaking? ■  Face tracking ■  Correlation

–  Pixel of each face –  Raw audio energy

■  Find maximum synchrony

Green: current speaker


How to Perform “Talking-‐Face” Authen:ca:on?

Face recognition

Speaker verification Score fusion

What if…?

OK

OK OK

Deliberate imposture


Biometrics

■  Identity Verification with Talking Faces –  Speaker Verification –  Face Recognition

■ What if?

Face

OK Voice

OK

NO X


Identity Verification

Enrolment of client λ

Model for client λ

Person ε pretending to be client λ

accepted if

rejected otherwise

Co-Inertia Analysis

Equal Error Rate: 30 %


Test

Replay Attacks Detection

Training

Co-IA CCA

accepted if

rejected otherwise

Sync Model


Replay Attacks Detection

Genuine synchronized video Audio replay attack Lips do not match audio perfectly

Equal Error Rate: 14 %


Example of Replay attacks


delayed video delayed audio

-5 0 +5

Alignment by maximum correlation

-1



■  Available features

–  Face : Face features (lip, eyes) à Face Modality –  Speech à Speech Modality –  Speech Synchrony à Synchrony Modality

Video



■  Face modality –  Detection:

•  Generative models (MPT toolbox) •  Temporal median Filtering •  Eyes detection within faces

–  Normalization: geometry + illumination


Audiovisual identity verification ■  Face Modality:

–  Two verification strategies and one single comparison framework

•  Global = Eigenfaces: – Calculation of a set of directions (eigenfaces)

defining a projection space –  Two faces are compared regarding their

projection on the eigenfaces space. –  Learning data: BIOMET (130 pers.) + BANCA

(30 pers.)



■  Face Modality: •  SIFT descriptors:

–  Keypoints extraction –  Keypoints representation: 128-dimensional

vector (gradient orientation histogramme,…) + 4-dimensional position vector

SIFT descriptor (dim 128)

Position (x,y) + scale + orientation (dim 4)


Audiovisual identity verification ■  Face Modality:

•  SVD-based matching method: –  Compare two videos V1 and V2 –  Exclusive principle: One-to-one correspondences

between »  Faces (global) »  Descriptors (local)

–  Principle: »  Proximity matrix computation between faces or

descriptors »  Extraction of good pairings (made easy by SVD

computation) –  Scores:

»  One matching score between global representations

»  One matching score between local representations


Variability !!!!



■  Speech Modality: –  GMM-based approach;

•  One world model •  Each speaker model is derived from the

World Model by MAP adaptation •  Speech verification score: derived from

likelihood ratio



■  Synchrony Modality: –  Principle: synchrony between lips and

speech carries identity information –  Process:

•  Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal)

•  Comparison of the test sample with the synchrony model


Audiovisual identity verification ■  Experiments:

–  BANCA database: •  52 persons divided into two groups (G1 and G2) •  3 recording conditions •  1 person à 8 recordings (4 client accesses, 4

impostor accesses) •  Evaluation based on P protocol: 234 client

accesses and 312 impostor accesses –  Scores:

•  4 scores per access (PCA face, SIFT face, speech, synchrony)

•  Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)



■  Experiments:


SecurePhone

■  Technical solution that improves security

■ Biometric recognition –  Makes use of VOICE, FACE and SIGNATURE

■  Electronic signature used to secure information exchange


Biometrics in SecurePhone

■ Operation

Pre-processing

Modelling Modelling

Modelling

Pre-processing Pre-processing

Access Denied Access Granted

FUSION

Face Voice Written Signature

Modelling


The BioSecure Multimodal Evaluation Campaign ■  Launched in April 2007 ■ Many modalities including ‘Video sequences’ and

‘Talking Faces’ ■  Development data and reference systems available ■  Evaluations on the sequestrated BioSecure data base

(1000 clients) ■  Debriefing workshop ■ More info on : http://www.int-evry.fr/biometrics/BMEC2007/index.php


Audio-‐visual forgery scenarios

■  Low-‐effort –  “Paparazzi” scenario

•  The impostor owns a picture of the face and a recording of the voice of the target

–  “Big Brother” scenario •  The impostor owns a video of the face and a recording of the voice of the target

■  High-‐effort –  “Imitator” scenario

•  The impostor owns a recording of the voice of the target and transforms his own voice to sound like the target

–  “Playback” scenario •  The impostor owns a picture of the face of the target and animate it according to his own face moAon

–  “Ventriloquist” scenario •  combines the two previous ones


Detec:on of imposture

Face modality: ACCEPTED!

Voice modality: ACCEPTED!

Synchronisation: DENIED!


Audio replay + “random” face

Talking-Face forgeries @ BMEC

Audio replay attack "   Assumptions

§  Forger has recorded speech data from the genuine user in outdoor (test) conditions

§  Forger is replaying the audio and uses his face in front of the sensor

Stolen wave Audio replay + forger face


CRAZY TALK Face animation + TTS Talking-Face forgeries @ BMEC

Replay attack "   Assumptions

§  Forger has stolen a picture §  Forger uses a face animation software and TTS (male or

female) §  Forger plays back the animation to the sensor

Stolen picture Contour detection Generated avi


Picture presentation + TTS forgeries

Talking-Face forgeries @ BMEC

Replay attack "   Assumptions

§  Forger has stolen a picture §  Forger has printed the picture §  Forger present the picture to the sensor and uses TTS

(same wave as for the face animation forgery)

Stolen picture Presented picture


Systems with fusion of (face, speech)

face score

speech score

fusion score

video sequence

frames

speech signal

Face verification

Speaker verification


Voice Conversion methods ■ GMM conversion

–  Training of a joined Gaussian model •  parallel corpus of aligned sentences of both source and target voice

•  MFCC on HNM (Harmonic plus Noise Model) parameterizaAon –  Speech synthesis from Gaussian model

•  Inversion of the MFCC •  Pitch correcAon

■ ALISP conversion –  Very low debit speech compression (500 bps) method

•  Originally developed by TELECOM-‐ParisTech –  Indexed segments dicAonary system (of the target voice) –  HNM parameterizaAon


Voice conversion techniques

Definition: Process of making one person’s voice « source » sounds like another person’s voice target

source target

Voice conversion

My name is John My name is John


Principle of ALISP

Dictionary of representative

segments

Dictionary of representative

segments

Spectral analysis

Prosodic analysis

Selection of segmental units

Segment index

Prosodic parameters

Input speech

Concatenative synthesis

HNM

Output speech

CODER


Details of Encoding

speech Spectral analysis

Prosodic analysis

HMM Recognition

Dictionary of HMM models of ALISP classes

Synth unit A1

… Synth unit A8

HMM A

Representative units of the

class

Selection by DTW

Prosodic encoding

Index of ALISP class

Index of synth. unit

Pitch, energy, duration


Details of decoding

Output speech

Synth unit A1

… Synth unit A8

ALISP Index

Synth unit index within class

Prosodic parameters

Loading Synth unit

Concatenative synthesis


Principle of Alisp conversion

Learning step: one hour of target voice

-  Parametric analysis: MFCC -  Segmentation based on temporal decompostion and vector quantization -  Stochastic modelling based on HMM -  Creation of representative units

Conversion step

- Parametric analysis: MFCC -  HMM recognition -  Selection of representative segment à DTW

Synthesis step

-  Concatenation of representative -  HNM synthesis


Voice conversion using ALISP results

BREF database NIST database

Source

Result

Target Source Target

Result

female female female male


Demonstra:on of Voice Conversion

Impostor voice Converted voice with GMM Converted voice with ALISP

Target voice Converted voice with ALISP+GMM


3D reconstruction •  3D face modeling from a front and a profile shot :

•  Animated face

•  https://picoforge.int-evry.fr/cgi-bin/twiki/view/Myblog3d/Web/Demos


Face Tranformation

Control point selection

Image segmentation

Figure 2: Division of an image Figure 1: Control points selec8on

Linear transformation

between source and target image

Blending step

source

target


Face Transformation

Source

?

54

-‐> LocalisaAon of control points

-‐> Warping -‐> Blending

Cible

? X’ = f(X)

p = αp + (1 – α)p’

X X’

p p’


Face transforma:on (IBM)


Ouisper1 - Silent Speech Interface

■  Sensor-based system allowing speech communication via standard articulators, but without glottal activity

■  Two distinct types of application –  alternative to tracheo-oesophagal speech (TES) for persons

having undergone a tracheotomy –  a "silent telephone" for use in situations where quiet must be

maintained, or for communication in very noisy environments

■  Speech Synthesis from ultrasound and optical imagery of the tongue and lips

1) Oral Ultrasound synthetIc SPEech souRce


Ouisper - System Overview

Ultrasound video of the vocal tract

Optical video of the speaker lips

Recorded audio

Speech Alignment

Text

Visual Feature Extraction

Audio-Visual Speech Corpus

Visual Speech Recognizer

Visual Unit Selection

Audio Unit Concatenation

TRAINING

TEST

Visual Data

N-best

Phonetic or ALISP Targets


Ouisper - Training Data


Ouisper - Video Stream Coding

T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007.

Eigenvectors

Build a subset of

typical frames

Perform PCA

Code new frames with their projections onto the set of Eigenvectors


Ouisper - Audio Stream Coding

ALISP Segmentation

Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques

Phonetic Segmentation

Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal

Corpus-based synthesis

Need of a preliminary segmental description of the signal


Audiovisual dictionary building

■  Visual and acoustic data are synchronously recorded

■  Audio segmentation is used to bootstrap visual speech recognizer

/e -‐ r/

2) Train HMM model for each phonetic class

/a -‐ j//u -‐ th/

Audiovisual dictionary


Visuo-acoustic decoding

■  Visual speech recognition –  Train HMM model for each visual class

•  Use multistream-based learning techniques

–  Perform a « visuo-phonetic » decoding step •  Use N-Best list •  Introduce linguistic constraints

–  Language model –  Dictionary –  Multigrams

■  Corpus-based speech synthesis –  Combine probabilistic and data-driven approach in the

audiovisual unit selection step.


Speech recognition from video-only data

ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh

ax w ih y uh r b uh k sh uw dh ax v er s p ey jh

Open your book to the first page

Ref

Rec

A wear your book shoe the verse page

Corpus-based synthesis driven by predicted phonetic lattice is currently under study


Ouisper - Conclusion

■ More information on –  http://www.neurones.espci.fr/ouisper/

■  Contacts –  [email protected] –  [email protected] –  [email protected]


Audio-Visual Speech Processing Conclusions and Perspectives

■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.

■  The combined use of facial and speech information improves identity verification and robustness to forgeries.

■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.


Technology

Atsip avsp17