40 Cross-Language Speech Dependent Lip-Synchronization · Datasets Results Cross-language results Approach Cross-Language Speech Dependent Lip-Synchronization Abhishek Jha 1 Vikram

Datasets

Results

Cross-language results

Approach

Cross-Language Speech Dependent Lip-SynchronizationAbhishek Jha 1 Vikram Voleti2 Vinay Namboodiri 3 C. V. Jawahar 1

1 CVIT, KCIS, IIIT Hyderabad, India 2Mila, Université de Montréal, Canada 3 Department of Computer Science, IIT Kanpur, India

Motivation

Goal: Given a video in foreign language, and dubbed speech in regional language, synchronize the lips in video to match the dubbed audio.

❖ Exponential growth in internet user in the last few years.❖ A majority of these users are college/school-goers and young adults.❖ Diverse demographic of students all over the world enroll in MOOCs

Challenges: Audio to lip fiducial generation, cross-language lip synchronization, cross identity lip synchronization. User-based study

Email: [email protected]@gmail.com

Project page: preon.iiit.ac.in/~abhishek_jha/lipsync_inst_vid

● Hindi Speech corpus: 2.5 hours of audio visual speech

● GRID Corpus: 33 speakers, 1000 phrase each (51 words)

● Andrew Ng ML videos: 16000 image frames extracted from deeplearning.ai MOOC videos.

● Telugu movies: scenes with protagonist’s face, extracted from Telugu movies

Contributions❖ Developed a cross-language lip-synchronization model for dubbed

speech videos (e.g. Hindi dubbing for English videos).

❖ Developed an automated pipeline to curate a dataset to train the proposed model.

❖ Conducted a user-based study, which shows learners prefer lip-sync for dubbed videos.

❖ Audio -> Lip landmarks : Train Time-delayed (TD) LSTM on grid corpus and Hindi speech corpus.❖ Intermediate processing: To search for the best face to append with generated lip landmarks.

❖ Face + Lip landmarks-> New face : Train U-Net on videos from movie dataset.

❖ Inference:➢ Generate lip-landmarks from audio using TD-LSTM,➢ Generate new frames by modifying original Andrew Ng video frames using U-Net

● C: Comfort level● US: Un-synced ● S: Lip-synced● LS%: Lip-sync percentage

Telu

gu M

ovie

sA

nd

rew

Ng

GR

ID C

orp

us

Hin

di Sp

eech

Dataset curation - automated pipeline

U-Net results

Recomputed ROI frames

Mask

U-N

ET

Landmark augmentation

Lip

gen

erat

ion

Inp

ut

vid

eo c

lip

t1 t2 t3 tn. . . . .

O1Time delay

. . . . .

Lip-landmarks (40D)

100 x 12D MFCC features

LST

M

Time-delayed LSTM Generation using U-Net

All lip-landmarks for dubbed audio

Clustering (6 clusters)

ℝ40

Lip-landmarks representation

ℝ40

ℝ40

Clu

ster

ass

ign

men

t

Frame selection

Intermediate-processing

Frame extraction

Target person’s

image

Face detection

Face recognition

Landmark extraction

Lippolygon

ROI masking

Merge

Hindi video

VoiceActivity

Detection

Video split

MFCC extraction

Face detection

Landmark extraction

Norm

Audio

Vid

eo

U-N

et

inp

ut

Gro

un

d

tru

th

0.4 . .0.21

1 . . .0.31

0.5 . . .1.0 ...

..T

D-L

STM

in

pu

ts4

0D

lip

la

nd

mar

ks

En

glis

h v

ideo

s p

ipel

ine

Hin

di v

ideo

s p

ipel

ine

Identity image

Speech video

Dubbing video

C - US C - S LS% - US LS% - S

Mean 2.51 3.1 23.86 45.95

Std. dev. 1.07 0.6 25.9 24.1

1. Audio -> Lip landmarks Video frame + New Lip landmarks video frame

->2.

mailto:[email protected]

mailto:[email protected]

Documents

40 Cross-Language Speech Dependent Lip-Synchronization · Datasets Results Cross-language results Approach Cross-Language Speech Dependent Lip-Synchronization Abhishek Jha 1 Vikram