SUPERVISED SPEAKER VERIFICATION: LARGE MARGIN FINE-TUNING & QUALITY-AWARE SCORE CALIBRATION
JENTHE THIENPONDT, BRECHT DESPLANQUES, KRIS DEMUYNCK
SUPERVISED SPEAKER VERIFICATION
2
BASELINE SYSTEMS
ECAPA-TDNN [1], Time Delay Neural Network enhanced by:Squeeze-Excitation ModulesRes2Net ModulesMulti-layer Feature AggregationChannel-dependent Attentive Statistics Pooling
SE-ResNet34:ResNet34 [2] with Squeeze-Excitation Modules
[2] Garcia-Romero, D., Sell, G., Mccree, A. (2020) MagNetO: X-vector Magnitude Estimation Network plus Offset for Improved Speaker Recognition. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 1-8, DOI: 10.21437/Odyssey.2020-1.
[1] Desplanques, B., Thienpondt, J., & Demuynck, K. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Interspeech 2020.
SUPERVISED SPEAKER VERIFICATION
3
BASELINE SYSTEMS
1. ECAPA-TDNN:1. Baseline model (c=2048)2. Tanh in pooling layer3. Bidirectional LSTM module4. 256-dimensional embedding dimension5. Sub-center AAM & dynamic dilation 6. 60-dimensional log mel-filterbanks
2. SE-ResNet34:1. Baseline model2. Channel-dependent attentive statistics pooling (CAS) 3. Sub-center AAM4. Sub-center AAM & CAS
Open task: replace 1.4 with baseline model trained on VoxCeleb1, VoxCeleb2, Librispeech subset and a part of the DeepMine corpus
4
§ Margin-based loss functions (e.g. AAM-softmax) are currently the most effective for classification and verification problems
§ Large margins increases inter-speaker distances and ensures intra-speaker compactness
§ However, training with high margins is difficult§ Leads to lower and possible sub-optimal margin values
LARGE MARGIN FINE-TUNINGOVERVIEW
5
§ Large margin fine-tuning strategy:§ Fine-tune on converged network with lower margin§ Increase length of utterances§ Cyclical learning rate schedule with lower maximum value and shorter cycles§ Hard sample mining [1]
§ Keep all weights trainable
LARGE MARGIN FINE-TUNING
[1] Thienpondt, J., Desplanques, B., & Demuynck, K. (2020). Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization. Interspeech 2020
6
§ Calibration: map similarity output scores to log-likelihood-ratios§ Including quality measurements can make the mapping more robust against
variability in recording quality, durations conditions…etc.§ Proposed quality-aware score calibration mapping:
QUALITY-AWARE SCORE CALIBRATIONOVERVIEW
Quality vector qLearnable weights Wq
7
§ Calibrated on VoxCeleb2 trials with various duration characteristics§ Quality measurements analyzed:
§ Duration-based§ Input frames§ Speech frames
§ Embedding-based§ Embedding magnitude
§ Imposter-based§ Imposter mean (cosine vs. inner product)§ Imposter mean (top-100 vs. all)
QUALITY-AWARE SCORE CALIBRATION
8
§ We want the quality measurements to be side-independent§ Possibilities:
§ Arithmetic mean§ Product§ Minimum of both sides§ => loss of information!
§ Symmetric Quality Measure Function (QMF)§ Consider minimum and maximum value as a separate feature§ Side-independent without loss of information
QUALITY-AWARE SCORE CALIBRATION
9
VOXSRC-20 VAL EER(%)
NO FINE-TUNING 3.25
LARGE-MARGIN FINE-TUNING 2.89
NO MARGIN INCREASE 3.36
NO DURATION INCREASE 3.58
NO CLR DECREASE 4.87
NO HARD SAMPLING 2.95
FROZEN PRE-POOLING LAYERS 3.12
§ Combining margin increase, duration increase and CLR decrease is essential
§ Hard sampling provides slight performance gain
§ Keeping all weights trainable is more effective
LARGE MARGIN FINE-TUNINGANALYSIS
10
VOXSRC-20 VAL EER(%)
ECAPA-TDNN 3.25
ECAPA-TDNN (FINE-TUNED) 2.89
+ DURATION QMF 2.68
+ SPEECH DURATION QMF 2.67
+ MAGNITUDE QMF 2.87
+ IMPOSTER MEAN QMF 2.81
+ SPEECH DURATION QMF+ IMPOSTER MEAN QMF 2.59
§ Duration-based measurements are the most effective.
§ Duration vs. speech duration provides only marginal improvements
§ Magnitude is the weakest, but still provides benefit
§ Imposter mean based on inner product can capture relevant utterance characteristics and is complementary with speech-based measurements
QUALITY-AWARE SCORE CALIBRATIONANALYSIS
11
VOXSRC-20 TEST EER(%) VOXSRC-20 TEST MINDCF
FUSION 4.20 0.2052
FUSION + FINE-TUNING 4.06 0.1890
FUSION + FINE-TUNING + QMFS 3.73 0.1772
§ All systems are fine-tuned using large margin fine-tuning§ All system output scores are averaged after regular calibration§ Speech duration and imposter mean QMFs are used for quality-aware score calibration§ The imposter means are calculated separately for each system and followed by averaging
FINAL FUSION SUBMISSIONRESULTS
UNSUPERVISED SPEAKER VERIFICATION: CONTRASTIVE LEARNING & ITERATIVE CLUSTERING
BRECHT DESPLANQUES, JENTHE THIENPONDT, KRIS DEMUYNCK
UNSUPERVISED SPEAKER VERIFICATION
2
VOXCELEB2 TRAINING DATA
J.S. Chung, A. Nagrani, A.Zisserman, ‘VoxCeleb2: Deep Speaker Recognition,‘ Proc. Interspeech 2018.
DATASET VOXCELEB2 - DEVELOPMENT
#UTTERANCES 1M+
#SPEAKERS ?
#VIDEOS ?
� 1M+ training utterances, without any meta-data� Validation on the VoxCeleb1-based VoxSRC20-val
UNSUPERVISED SPEAKER VERIFICATION
3
THREE-STAGE APPROACH
1. Create model that generates unique & consistent utterance embeddings2. Iterative clustering to generate pseudo-labels3. Supervised training on pseudo-labels, robust against label-noise
STAGE I: CONTRASTIVE LEARNING
4
P. Khosla, P. Teterwak, et. al., ‘Supervised ContrastiveLearning,‘ arXiv preprint arXiv:2004.11362
� Speaker embeddings of the same utterance should be similar, but unique� The speaker embedding should be invariant to augmentations� Non-overlapping temporal crops, additive noise, reverb, …
STAGE I: MOMENTUM CONTRAST
5
K. He, et. al., ‘Momentum contrast for unsupervised visual representation learning,‘ in IEEE/CVF CVPR, 2020
� Limited # negative comparison pairs for reasonable mini-batch sizes� Create a copy momentum encoder that does not use backpropagation� Positive momentum samples become negative in the next training iteration� Store all negative samples in a large buffer� Training takes about 12 hours on a single GPU
STAGE II: AGGLOMERATIVE HIERARCHICAL CLUSTERING (AHC)
6
� High-quality embedding extraction on full length & clean training data� AHC is infeasible on 1M+ utterances due to memory complexity� Initial mini-batch K-means clustering (50K clusters)� AHC on K-means cluster centers (7.5K clusters)
Clustering
STAGE II: ITERATIVE CLUSTERING & TRAINING ON PSEUDO-LABELS
7
� The cluster IDs can be interpreted as a pseudo-label� Use a supervised loss & train with pseudo-labels� Re-extract training embeddings & Re-cluster� Iterate the whole process (up to 7 times)� Currently 4 hours per cycle on a single GPU
STAGE III: NOISE-ROBUST TRAINING ON FINAL PSEUDO-LABELS
8
� Use the final pseudo-labels to train a larger model� Label-noise robust techniques: Sub-Center AAM� Do the noise-robust techniques work on this kind of label-noise?
J. Deng, et. al., ‘Sub-center arcface: Boosting face recognition by large-scale noisy web,‘ in IEEE ECCV, 2020
VOXSRC-20 VALIDATION RESULTS
9
IN STEPS OF 2.5K CLUSTERS
7.3% EER on Vox1 - original
2.1% EER on Vox1 - original
Too few clusters
Too many clusters
VOXSRC-20 CHALLENGE RESULTS & CONCLUSION
10
UNSUPERVISED SPEAKER VERIFICATION
VOXSRC-20 TEST EER(%)
ITERATIVE CLUSTERING (7 ITERATIONS) 7.7%
SCORE AVERAGING: ITERATIVE CLUSTERING & LARGER MODEL 7.2%
REFERENCE SUPERVISED SYSTEM (FUSION) 3.7%
� Combination of contrastive learning & iterative clustering delivers SOTA� Not having any manual speaker labels currently doubles the EER on VoxCeleb� What about performance on unsupervised datasets without any data-cleaning
or reasonable speaker balance?