Download pdf - SUPERVISED SPEAKER VERIFICATION: LARGE MARGIN FINE …vgg/data/voxceleb/data... · 2020. 11. 2. · LARGE MARGIN FINE-TUNING & QUALITY-AWARE SCORE CALIBRATION JENTHE THIENPONDT, BRECHT

SUPERVISED SPEAKER VERIFICATION: LARGE MARGIN FINE-TUNING & QUALITY-AWARE SCORE CALIBRATION

JENTHE THIENPONDT, BRECHT DESPLANQUES, KRIS DEMUYNCK

SUPERVISED SPEAKER VERIFICATION

2

BASELINE SYSTEMS

ECAPA-TDNN [1], Time Delay Neural Network enhanced by:Squeeze-Excitation ModulesRes2Net ModulesMulti-layer Feature AggregationChannel-dependent Attentive Statistics Pooling

SE-ResNet34:ResNet34 [2] with Squeeze-Excitation Modules

[2] Garcia-Romero, D., Sell, G., Mccree, A. (2020) MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 1-8, DOI: 10.21437/Odyssey.2020-1.

[1] Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Interspeech 2020.

SUPERVISED SPEAKER VERIFICATION

3

BASELINE SYSTEMS

1. ECAPA-TDNN:1. Baseline model (c=2048)2. Tanh in pooling layer3. Bidirectional LSTM module4. 256-dimensional embedding dimension5. Sub-center AAM & dynamic dilation 6. 60-dimensional log mel-filterbanks

2. SE-ResNet34:1. Baseline model2. Channel-dependent attentive statistics pooling (CAS) 3. Sub-center AAM4. Sub-center AAM & CAS

Open task: replace 1.4 with baseline model trained on VoxCeleb1, VoxCeleb2, Librispeech subset and a part of the DeepMine corpus

4

§ Margin-based loss functions (e.g. AAM-softmax) are currently the most effective for classification and verification problems

§ Large margins increases inter-speaker distances and ensures intra-speaker compactness

§ However, training with high margins is difficult§ Leads to lower and possible sub-optimal margin values

LARGE MARGIN FINE-TUNINGOVERVIEW

5

§ Large margin fine-tuning strategy:§ Fine-tune on converged network with lower margin§ Increase length of utterances§ Cyclical learning rate schedule with lower maximum value and shorter cycles§ Hard sample mining [1]

§ Keep all weights trainable

LARGE MARGIN FINE-TUNING

[1] Thienpondt, J., Desplanques, B., & Demuynck, K. (2020). Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization. Interspeech 2020

6

§ Calibration: map similarity output scores to log-likelihood-ratios§ Including quality measurements can make the mapping more robust against

variability in recording quality, durations conditions…etc.§ Proposed quality-aware score calibration mapping:

QUALITY-AWARE SCORE CALIBRATIONOVERVIEW

Quality vector qLearnable weights Wq

7

§ Calibrated on VoxCeleb2 trials with various duration characteristics§ Quality measurements analyzed:

§ Duration-based§ Input frames§ Speech frames

§ Embedding-based§ Embedding magnitude

§ Imposter-based§ Imposter mean (cosine vs. inner product)§ Imposter mean (top-100 vs. all)

QUALITY-AWARE SCORE CALIBRATION

8

§ We want the quality measurements to be side-independent§ Possibilities:

§ Arithmetic mean§ Product§ Minimum of both sides§ => loss of information!

§ Symmetric Quality Measure Function (QMF)§ Consider minimum and maximum value as a separate feature§ Side-independent without loss of information

QUALITY-AWARE SCORE CALIBRATION

9

VOXSRC-20 VAL EER(%)

NO FINE-TUNING 3.25

LARGE-MARGIN FINE-TUNING 2.89

NO MARGIN INCREASE 3.36

NO DURATION INCREASE 3.58

NO CLR DECREASE 4.87

NO HARD SAMPLING 2.95

FROZEN PRE-POOLING LAYERS 3.12

§ Combining margin increase, duration increase and CLR decrease is essential

§ Hard sampling provides slight performance gain

§ Keeping all weights trainable is more effective

LARGE MARGIN FINE-TUNINGANALYSIS

10

VOXSRC-20 VAL EER(%)

ECAPA-TDNN 3.25

ECAPA-TDNN (FINE-TUNED) 2.89

+ DURATION QMF 2.68

+ SPEECH DURATION QMF 2.67

+ MAGNITUDE QMF 2.87

+ IMPOSTER MEAN QMF 2.81

+ SPEECH DURATION QMF+ IMPOSTER MEAN QMF 2.59

§ Duration-based measurements are the most effective.

§ Duration vs. speech duration provides only marginal improvements

§ Magnitude is the weakest, but still provides benefit

§ Imposter mean based on inner product can capture relevant utterance characteristics and is complementary with speech-based measurements

QUALITY-AWARE SCORE CALIBRATIONANALYSIS

11

VOXSRC-20 TEST EER(%) VOXSRC-20 TEST MINDCF

FUSION 4.20 0.2052

FUSION + FINE-TUNING 4.06 0.1890

FUSION + FINE-TUNING + QMFS 3.73 0.1772

§ All systems are fine-tuned using large margin fine-tuning§ All system output scores are averaged after regular calibration§ Speech duration and imposter mean QMFs are used for quality-aware score calibration§ The imposter means are calculated separately for each system and followed by averaging

FINAL FUSION SUBMISSIONRESULTS

UNSUPERVISED SPEAKER VERIFICATION: CONTRASTIVE LEARNING & ITERATIVE CLUSTERING

BRECHT DESPLANQUES, JENTHE THIENPONDT, KRIS DEMUYNCK

UNSUPERVISED SPEAKER VERIFICATION

2

VOXCELEB2 TRAINING DATA

J.S. Chung, A. Nagrani, A.Zisserman, ‘VoxCeleb2: Deep Speaker Recognition,‘ Proc. Interspeech 2018.

DATASET VOXCELEB2 - DEVELOPMENT

#UTTERANCES 1M+

#SPEAKERS ?

#VIDEOS ?

� 1M+ training utterances, without any meta-data� Validation on the VoxCeleb1-based VoxSRC20-val


3

THREE-STAGE APPROACH

1. Create model that generates unique & consistent utterance embeddings2. Iterative clustering to generate pseudo-labels3. Supervised training on pseudo-labels, robust against label-noise

STAGE I: CONTRASTIVE LEARNING

4

P. Khosla, P. Teterwak, et. al., ‘Supervised ContrastiveLearning,‘ arXiv preprint arXiv:2004.11362

� Speaker embeddings of the same utterance should be similar, but unique� The speaker embedding should be invariant to augmentations� Non-overlapping temporal crops, additive noise, reverb, …

STAGE I: MOMENTUM CONTRAST

5

K. He, et. al., ‘Momentum contrast for unsupervised visual representation learning,‘ in IEEE/CVF CVPR, 2020

� Limited # negative comparison pairs for reasonable mini-batch sizes� Create a copy momentum encoder that does not use backpropagation� Positive momentum samples become negative in the next training iteration� Store all negative samples in a large buffer� Training takes about 12 hours on a single GPU

STAGE II: AGGLOMERATIVE HIERARCHICAL CLUSTERING (AHC)

6

� High-quality embedding extraction on full length & clean training data� AHC is infeasible on 1M+ utterances due to memory complexity� Initial mini-batch K-means clustering (50K clusters)� AHC on K-means cluster centers (7.5K clusters)

Clustering

STAGE II: ITERATIVE CLUSTERING & TRAINING ON PSEUDO-LABELS

7

� The cluster IDs can be interpreted as a pseudo-label� Use a supervised loss & train with pseudo-labels� Re-extract training embeddings & Re-cluster� Iterate the whole process (up to 7 times)� Currently 4 hours per cycle on a single GPU

STAGE III: NOISE-ROBUST TRAINING ON FINAL PSEUDO-LABELS

8

� Use the final pseudo-labels to train a larger model� Label-noise robust techniques: Sub-Center AAM� Do the noise-robust techniques work on this kind of label-noise?

J. Deng, et. al., ‘Sub-center arcface: Boosting face recognition by large-scale noisy web,‘ in IEEE ECCV, 2020

VOXSRC-20 VALIDATION RESULTS

9

IN STEPS OF 2.5K CLUSTERS

7.3% EER on Vox1 - original

2.1% EER on Vox1 - original

Too few clusters

Too many clusters

VOXSRC-20 CHALLENGE RESULTS & CONCLUSION

10


VOXSRC-20 TEST EER(%)

ITERATIVE CLUSTERING (7 ITERATIONS) 7.7%

SCORE AVERAGING: ITERATIVE CLUSTERING & LARGER MODEL 7.2%

REFERENCE SUPERVISED SYSTEM (FUSION) 3.7%

� Combination of contrastive learning & iterative clustering delivers SOTA� Not having any manual speaker labels currently doubles the EER on VoxCeleb� What about performance on unsupervised datasets without any data-cleaning

or reasonable speaker balance?