32
Speech Research Activities @ University of Eastern Finland Management Committee Meeting of COST Action IC1206, Zagreb, July 8-9, 2013 Dr. Tomi H. Kinnunen School of Computing, UEF [email protected]

Speech Research Activities @ University of Eastern Finlandcostic1206.uvigo.es/sites/default/files/Meetings/...Ultrasonic, medical and HDR imaging ... Spectrum extraction using SPTK

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Speech Research Activities @ University of Eastern Finland

    Management Committee Meeting of COST Action IC1206, Zagreb, July 8-9, 2013

    Dr. Tomi H. Kinnunen

    School of Computing, UEF

    [email protected]

    mailto:[email protected]

  • University of Eastern Finland

    • Three campuses: Joensuu, Kuopio, Savonlinna • Four faculties: Science and forestry, health science,

    philosophical, business studies.

    • 15000 students, 2800 staff members

    School of Computing

    • Annually 100 MSc + 10 PhD degrees • International master’s program (IMPIT) • Multimedia computing one of the 3 focus areas

  • Speech & Image Processing Unit (SIPU)

    Prof. Pasi Fränti (team leader)

    Dr. Tomi Kinnunen (leader of speech processing topics)

    Dr. Ville Hautamäki (postdoc)

    Dr. Padmanabhan Rajan (postdoc)

    Dr. Rahim Saeidi (postdoc @ Radboud Univ, the Netherlands)

    Dr. Cemal Hanilci (collaborator @ Uludag Univ, Turkey)

    + Several PhD students working on speech technology, data clustering and location-based systems (see the webpage for details)

    Core team

    http://www.uef.fi/en/sipu

    http://www.uef.fi/en/sipu/peoplehttp://www.uef.fi/en/sipu

  • Research topics

    Speech processing

    Clustering methods

    Location-based application

    Clustering algorithms

    Clustering validity

    Graph clustering

    Gaussian mixture models

    Speaker recognition

    Feature extraction

    Voice activity detection

    Voice conversion

    Mobile data collection

    Route reduction and compression

    Photo collections and social networks

    Location-aware services & search engine

    Lossless compression and data reduction

    Image denoising

    Ultrasonic, medical and HDR imaging

    Image processing & compression

  • Speaker recognition activities • Odyssey 2014: The Speaker and Language Recognition Workhop in Joensuu

    • Theses: Four PhD theses on speaker recognition: Tomi Kinnunen (2005), Ville Hautamäki

    (2008), Rahim Saeidi (2011), Evgeny Karpov (2011)

    • Funding: National TEKES-funded PUMS project (2003-2007) of Fränti, 3-year postdoc

    projects of Kinnunen (speaker recognition, 2010-2012) and Hautamäki (dialect and accent

    recognition, 2012-2014, ongoing), and 4-year Academy project of Kinnunen, ”Reliable

    Speaker Recognition and Modification” (2012-2015, ongoing)

    • Main collaborators: I2R and NTU (Singapore), Aalto university (Finland), Georgia Tech

    (USA), Aalborg University (Denmark), Lund University (Sweden), Uludag University (Turkey),

    CRIM (Canada), Speech Technology Center (Russia)

    • NIST SRE: Participation in 2006, 2008, 2010, 2012

    • Selected publications:

    • V. Hautamäki, T. Kinnunen, F. Sedlak, K.A. Lee, B. Ma, H. Li, “Sparse Classifier Fusion for Speaker

    Verification”, IEEE Trans. Audio, Speech and Language Processing, 21(8): 1622—1681, August 2013.

    • T. Kinnunen, R. Saeidi, F. Sedlak, K.A. Lee, J. Sandberg, M. Hansson-Sandsten, H. Li, “Low-Variance

    Multitaper MFCC Features: a Case Study in Robust Speaker Verification”, IEEE Trans. Audio, Speech and

    Language Processing, 20 (7), 1990-2001, 2012.

    • C. Hanilci, T. Kinnunen, F. Ertas, R. Saeidi, J. Pohjalainen, P. Alku, “Regularized All-Pole Models for Speaker

    Verification Under Noisy Environments”, IEEE Signal Processing Letters 19(3), 163-166, March 2012

    • T. Kinnunen and H. Li, ”An overview of text-independent speaker recognition: from features to supervectors”,

    Speech Communication, 52(1): 12--40, January 2010.

    • T. Kinnunen, E. Karpov, P. Fränti, “Real-Time Speaker Identification and Verification”, IEEE Transactions on

    Audio, Speech and Language Processing, 14(1): 277--288, Jan 2006.

  • I4U submission to NIST SRE 2012

    1. ValidSoft Ltd, UK

    2. Swansea University, UK

    3. University of Avignon, France

    4. Radboud University Nijmegen, the Netherlands

    5. University of Texas at Dallas, USA

    6. University of Eastern Finland, Finland

    7. Institute for Infocomm Research, Singapore

    8. IDIAP Research Institute, Switzerland

    [ R. Saeidi et al. “I4U submission to NIST SRE 2012: a large-scale collaborative effort

    for noise-robust speaker verification”, Interspeech 2013 (to appear) ]

  • International summer schools and seminars

    19th International Summer School in Novel

    Computing

    June 18-21, 2012, UEF, Joensuu Campus

    Recent Advances in Probabilistic Modeling for

    Pattern Recognition

    Dr. Patrick Kenny, Lead researcher, CRIM,

    Montreal, Canada.

    Attendance: 43 participants

    Social Computing

    Dr. Rosta Farzan, Postdoc researcher, Human

    Computer Interaction Institute, Carnegie Mellon

    University, USA.

    Attandance: 22 participants

    Winter workshop on Data Mining and Pattern Recognition

    March 4-6, 2013, UEF, Mekrijärvi research station, Ilomantsi

    Scientific Presentations Skills

    June 10-12, 2013, UEF, Joensuu Campus

    Dr. Jean-Luc Lebrun

    Scientific Writing Skills

    August 6-8, 2012, UEF, Joensuu Campus

    Dr. Jean-Luc Lebrun

    Attendance: 96 participants

    16th International Summer School in Novel Computing

    August 10 - August 14, 2009, Joensuu

    Speaker and Language Recognition

    Dr. Douglas A. Reynolds, Lincoln Laboratory, MIT

    Attendance: 34 participants

    Platforms for Stories-based Learning in Future Schools

    Prof. Paul De Bra, Eindhoven University of Technology, The Netherlands

    Attendance: 18 participants

    http://cs.joensuu.fi/ecse/2012/http://cs.joensuu.fi/ecse/2012/http://www.crim.ca/en/r-d/bottin_recherche/?fiche=/fr/r-d/reconnaissance_parole/equipe/fiche07.htmlhttp://rosta-farzan.net/http://cs.joensuu.fi/ecse/dmpr2013/http://cs.joensuu.fi/ecse/dmpr2013/http://cs.joensuu.fi/ecse/SciPre2013/http://www.scientific-writing.com/http://www.scientific-writing.com/http://www.scientific-writing.com/http://cs.joensuu.fi/ecse/SciWri2012/http://cs.joensuu.fi/ecse/SciWri2012/http://cs.joensuu.fi/ecse/SciWri2012/http://www.scientific-writing.com/http://www.scientific-writing.com/http://www.scientific-writing.com/http://cs.joensuu.fi/ecse/16issnc/index.htmlhttp://cs.joensuu.fi/ecse/16issnc/index.htmlhttp://www.ll.mit.edu/mission/communications/ist/biographies/reynolds-bio.htmlhttp://wwwis.win.tue.nl/~debra/

  • Activities relevant to

    COST action IC1206

    1) Speaker recognition

    2) Voice conversion

    3) Spoofing and anti-spoofing for speaker

    recognition

  • Spoofing speaker recognizers

    Human imitators (Lau et. al., 2005; Farrus et. al. 2008)

    Playback attacks (Lindberg & Blomberg, Eurospeech

    1999 ; Villalba & Lleida, FALA 2010)

    Speaker-adapted speech synthesis (Pellom &

    Hansen, ICASSP 1999; Satoh et. al., Eurospeech 2001;

    DeLeon et. al., Speaker Odyssey 2010)

    Voice conversion (Jin et. al., ICASSP 2008 ; Bonastre

    et. al., Interspeech 2007 + many more)

    SPECIAL SESSION: ”Spoofing and countermeasures for automatic

    speaker verification”

    Organizers: Nick Evans (EURECOM), Tomi Kinnunen (UEF), Junichi

    Yamagishi (Univ. Edinburgh), Sebastien Marcel (IDIAP)

  • Voice Conversion Feature extraction Spectrum extraction using SPTK toolkit, 30 mel-cepstral coefficients (MCEP)

    F0 extraction using the RAPT algorithm

    Mapping function: -Joint density GMM (Kain & Macon, ICASSP 1998)

    Frame alignment - VQ codebook mapping (Sundermann et al, Interspeech 2004)

  • Speaker modeling:

    selected approaches

    Approach Reference

    Fra

    me-

    based 1. Gaussian mixture model with universal

    background model (GMM-UBM)

    Reynolds et al, 2000

    2. Vector quantizer with universal

    background model (VQ-UBM)

    Hautamäki et al, 2008

    Uttera

    nce-

    based

    3. Generalized linear discriminant

    sequence support vector machine

    (GLDS-SVM)

    Campbell et al, 2006a

    4. GMM supervectors with support vector

    machine (GMM-SVM)

    Campbell et al, 2006b

    5. GMM with joint factor analysis

    (GMM-JFA)

    Kenny et al, 2005, 2006,

    2008,

  • Approach Reference

    Fra

    me-

    based 1. Gaussian mixture model with universal

    background model (GMM-UBM)

    Reynolds et al, 2000

    2. Vector quantizer with universal background

    model (VQ-UBM)

    Hautamäki et al, 2008

    Uttera

    nce-

    based

    3. Generalized linear discriminant sequence

    support vector machine

    (GLDS-SVM)

    Campbell et al, 2006a

    4. GMM supervectors with support vector

    machine (GMM-SVM)

    Campbell et al, 2006b

    5. GMM with joint factor analysis

    (GMM-JFA)

    Kenny et al, 2005, 2006,

    2008,

  • Gaussian mixture model (GMM)

    Density

    Prior probability

    Mean vector

    Cov. matrix

    multivariate Gaussian density

  • Maximum a Posteriori (MAP)

    Adaptation of GMMs

    Adapted mean vector

    Mean of training data

    Prior mean from univ. background model (UBM)

    Adaptation coefficient (r = relevance factor, usually r = 16)

    Mean of training data assigned to kth Gaussian

    Soft count of vectors assigned to kth Gaussian

    Posterior probability of the kth Gaussian for one feature vector

  • Approach Reference

    Fra

    me-

    based 1. Gaussian mixture model with universal

    background model (GMM-UBM)

    Reynolds et al, 2000

    2. Vector quantizer with universal

    background model (VQ-UBM)

    Hautamäki et al, 2008

    Uttera

    nce-

    based

    3. Generalized linear discriminant sequence

    support vector machine

    (GLDS-SVM)

    Campbell et al, 2006a

    4. GMM supervectors with support vector

    machine (GMM-SVM)

    Campbell et al, 2006b

    5. GMM with joint factor analysis

    (GMM-JFA)

    Kenny et al, 2005, 2006,

    2008,

  • Vector Quantization (VQ) One of the “classical” speaker recognition methods,

    similar performance with GMM with reduced computation

    Speaker model = codebook C = {c1, c2, … , cK}, where ck are the centroid vectors. Usually these are obtained

    using K-means

    kkkkk uxc )1(

    kn S

    n

    k

    kS x

    xx1

    rS

    S

    k

    k

    k

    mS mkk cxcxx ;where

    r = relevance factor (fixed constant)

    Centroid of speaker’s training vectors that are mapped to UBM vector uk

    UBM centroid

  • Approach Reference

    Fra

    me-

    based 1. Gaussian mixture model with universal

    background model (GMM-UBM)

    Reynolds et al, 2000

    2. Vector quantizer with universal background

    model (VQ-UBM)

    Hautamäki et al, 2008

    Uttera

    nce-

    based

    3. Generalized linear discriminant

    sequence support vector machine

    (GLDS-SVM)

    Campbell et al, 2006a

    4. GMM supervectors with support vector

    machine (GMM-SVM)

    Campbell et al, 2006b

    5. GMM with joint factor analysis

    (GMM-JFA)

    Kenny et al, 2005, 2006,

    2008,

  • Sequence-kernel SVMs

  • Generalized Linear Discriminant

    Sequence SVM (GLDS-SVM) [Campbell et al, 2006]

    Use monomials (up to certain degree) to expand feature vectors and use the average to represent utterances:

    Note: dimensionality = (d + M)! / (d! x M!)

    Use linear kernel SVM with these supervectors

    Example: expansion from 2-dimensional input space

    using 2nd order expansion

  • Approach Reference

    Fra

    me-

    based 1. Gaussian mixture model with universal

    background model (GMM-UBM)

    Reynolds et al, 2000

    2. Vector quantizer with universal background

    model (VQ-UBM)

    Hautamäki et al, 2008

    Uttera

    nce-

    based

    3. Generalized linear discriminant sequence

    support vector machine

    (GLDS-SVM)

    Campbell et al, 2006a

    4. GMM supervectors with support vector

    machine (GMM-SVM)

    Campbell et al, 2006b

    5. GMM with joint factor analysis

    (GMM-JFA)

    Kenny et al, 2005, 2006,

    2008,

  • GMM Supervectors [Campbell et al, 2006]

    MAP

    adaptation

    Feature

    extraction

    Universal background model

    Speech utterance

    μ1 μ2

    μK

    ... μ =

    GMM mean supervector of dimensionality K × d, where K = number of Gaussians, d = number of acoustic features

  • Approach Reference

    Fra

    me-

    based 1. Gaussian mixture model with universal

    background model (GMM-UBM)

    Reynolds et al, 2000

    2. Vector quantizer with universal background

    model (VQ-UBM)

    Hautamäki et al, 2008

    Uttera

    nce-

    based

    3. Generalized linear discriminant sequence

    support vector machine

    (GLDS-SVM)

    Campbell et al, 2006a

    4. GMM supervectors with support vector

    machine (GMM-SVM)

    Campbell et al, 2006b

    5. GMM with joint factor analysis

    (GMM-JFA)

    Kenny et al, 2005, 2006,

    2008,

  • Joint Factor Analysis (JFA)

    Speaker- and channel-dependent supervector

    Speaker supervector

    Channel supervector

    Assume that each supervector is composed as a sum of two

    statistically independent components:

    UBM supervector

    Eigenvoice matrix

    Speaker factors

    Residual Term Dz

    Eigenchannel matrix

    Channel factors

    [Kenny et al, 2005, 2007, 2008, http://www.crim.ca/perso/patrick.kenny/]

    V, D and U are the model hyperparameters trained beforehand on large

    datasets ; x, y and z need to be estimated for a given training sample

    JFA cookbook from Brno Univ. Tech (BUT),

    http://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demo

    http://www.crim.ca/perso/patrick.kenny/http://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demo

  • Spoofing results

    Classifier Intersession

    compensation

    Baseline

    accuracy

    (EER, %)

    Accuracy on

    spoofed samples

    (EER, %)

    GMM-UBM - 7.63 24.99

    VQ-UBM - 7.56 22.62

    GLDS-SVM - 7.16 25.17

    GMM-SVM Nuis. attrib.

    projection (NAP)

    3.74 12.58

    GMM-JFA Joint factor

    analysis (JFA)

    3.24 7.61

    T. Kinnunen, Z.-Z. Wu, K. A. Lee, F. Sedlak, E. S. Chng, H. Li, “Vulnerability of Speaker Verification Systems

    Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech”, Proc. ICASSP 2012

  • Calibration breaks down even

    for the advanced classifiers

    False acceptance rates (threshold = EER point on baseline)

    GMM-SVM GMM-JFA

    Baseline (no spoofing) 3.74 3.24

    Voice conversion spoofing (JD-GMM) 41.54 17.33

  • Spoofing i-vector systems Z. Wu, T. Kinnunen, E.S. Chng, H. Li, E. Ambikairajah, ”A study on spoofing attack in state-

    of-the-art speaker verification: the telephone speech case”, APSIPA 2012, Hollywood, US,

    December 2012

    False acceptance rates (threshold = EER point on baseline)

    GMM-JFA i-vector PLDA

    Baseline (no spoofing) 3.24 2.99

    JD-GMM conversion 17.36 19.29

    Unit selection conversion 32.54 41.25

  • Study with a human impersonator

    False acceptance rates (threshold = EER point on baseline)

    GMM-UBM i-vector PLDA

    Baseline (no spoofing) 11.11 9.03

    Mimicry attack 9.68 11.61

    R. Gonzales Hautamäki, Tomi Kinnunen, Ville Hautamäki, Timo Leino, Anne-Maria

    Laukkanen, ”I-vectors meet imitators: on vulnerability of speaker verification systems against

    voice mimicry”, Interspeech 2013 (to appear)

  • Acknowledgements

    COST Action IC 1206 and members

    Academy of Finland for partial funding

    Zhizheng Wu, Eng Siong Chng, Haizhou Li (I2R and NTU,

    Singapore) for the joint studies on spoofing

    Nick Evans, Junichi Yamagishi, Sebastien Marcel for the

    joint organization of Interspeech 2013 special session

    Other colleagues at SIPU who contributed to the studies

    and material presented herein