Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Datasets
Results
Cross-language results
Approach
Cross-Language Speech Dependent Lip-SynchronizationAbhishek Jha 1 Vikram Voleti2 Vinay Namboodiri 3 C. V. Jawahar 1
1 CVIT, KCIS, IIIT Hyderabad, India 2Mila, Université de Montréal, Canada 3 Department of Computer Science, IIT Kanpur, India
Motivation
Goal: Given a video in foreign language, and dubbed speech in regional language, synchronize the lips in video to match the dubbed audio.
❖ Exponential growth in internet user in the last few years.❖ A majority of these users are college/school-goers and young adults.❖ Diverse demographic of students all over the world enroll in MOOCs
Challenges: Audio to lip fiducial generation, cross-language lip synchronization, cross identity lip synchronization. User-based study
Email: [email protected]@gmail.com
Project page: preon.iiit.ac.in/~abhishek_jha/lipsync_inst_vid
● Hindi Speech corpus: 2.5 hours of audio visual speech
● GRID Corpus: 33 speakers, 1000 phrase each (51 words)
● Andrew Ng ML videos: 16000 image frames extracted from deeplearning.ai MOOC videos.
● Telugu movies: scenes with protagonist’s face, extracted from Telugu movies
Contributions❖ Developed a cross-language lip-synchronization model for dubbed
speech videos (e.g. Hindi dubbing for English videos).
❖ Developed an automated pipeline to curate a dataset to train the proposed model.
❖ Conducted a user-based study, which shows learners prefer lip-sync for dubbed videos.
❖ Audio -> Lip landmarks : Train Time-delayed (TD) LSTM on grid corpus and Hindi speech corpus.❖ Intermediate processing: To search for the best face to append with generated lip landmarks.
❖ Face + Lip landmarks-> New face : Train U-Net on videos from movie dataset.
❖ Inference:➢ Generate lip-landmarks from audio using TD-LSTM,➢ Generate new frames by modifying original Andrew Ng video frames using U-Net
● C: Comfort level● US: Un-synced ● S: Lip-synced● LS%: Lip-sync percentage
Telu
gu M
ovie
sA
nd
rew
Ng
GR
ID C
orp
us
Hin
di Sp
eech
Dataset curation - automated pipeline
U-Net results
Recomputed ROI frames
Mask
U-N
ET
Landmark augmentation
Lip
gen
erat
ion
Inp
ut
vid
eo c
lip
t1 t2 t3 tn. . . . .
O1Time delay
. . . . .
Lip-landmarks (40D)
100 x 12D MFCC features
LST
M
Time-delayed LSTM Generation using U-Net
All lip-landmarks for dubbed audio
Clustering (6 clusters)
ℝ40
Lip-landmarks representation
ℝ40
ℝ40
Clu
ster
ass
ign
men
t
Frame selection
Intermediate-processing
Frame extraction
Target person’s
image
Face detection
Face recognition
Landmark extraction
Lippolygon
ROI masking
Merge
Hindi video
VoiceActivity
Detection
Video split
MFCC extraction
Face detection
Landmark extraction
Norm
Audio
Vid
eo
U-N
et
inp
ut
Gro
un
d
tru
th
0.4 . .0.21
1 . . .0.31
0.5 . . .1.0 ...
..T
D-L
STM
in
pu
ts4
0D
lip
la
nd
mar
ks
En
glis
h v
ideo
s p
ipel
ine
Hin
di v
ideo
s p
ipel
ine
Identity image
Speech video
Dubbing video
C - US C - S LS% - US LS% - S
Mean 2.51 3.1 23.86 45.95
Std. dev. 1.07 0.6 25.9 24.1
1. Audio -> Lip landmarks Video frame + New Lip landmarks video frame
->2.