55
Automatic Speech Automatic Speech Recognition Recognition and and Audio Indexing Audio Indexing E.M. Bakker E.M. Bakker LIACS Media Lab LIACS Media Lab

Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Embed Size (px)

Citation preview

Page 1: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Automatic Speech Automatic Speech Recognition Recognition

and and Audio IndexingAudio Indexing

E.M. BakkerE.M. Bakker

LIACS Media LabLIACS Media Lab

Page 2: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

TopicsTopics

Live TV CaptionLive TV CaptionAudio-based Surveillance SystemsAudio-based Surveillance SystemsSpeaker RecognitionSpeaker RecognitionMusic/Speech SegmentationMusic/Speech Segmentation

Page 3: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

TV Captioning TV Captioning transcribe and display the spoken parts of transcribe and display the spoken parts of

television programstelevision programsBBC pioneer started live captioning in BBC pioneer started live captioning in

2001 and is captioning all of its 2001 and is captioning all of its broadcasted programs since May 2008broadcasted programs since May 2008

Czech Television live captions for Czech Television live captions for Parliament broadcasts since November Parliament broadcasts since November 20082008

Page 4: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

Essential for people with hearing ompairmentEssential for people with hearing ompairment Helps non-native viewers with their languageHelps non-native viewers with their language Improve first-language literacyImprove first-language literacy Studies showStudies show

In UK 80% of viewers of TV with captions have no In UK 80% of viewers of TV with captions have no hearing impairment []hearing impairment []

In India watching TV with captions improved literacy In India watching TV with captions improved literacy skills []skills []

Page 5: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

Manual transcription of pre-recorded programsManual transcription of pre-recorded programs Human captioner listens to and transcribes all speech Human captioner listens to and transcribes all speech

in a programin a program Transcription is edited, positioned and alignedTranscription is edited, positioned and aligned 16 hours work for 1 hour program16 hours work for 1 hour program

Page 6: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

Captioning of pre-recorded Captioning of pre-recorded

programs using programs using

Speech RecognitionSpeech Recognition

Human dictates the speech audio of the Human dictates the speech audio of the programprogram

Transcription is produced by speech Transcription is produced by speech recognitionrecognition

Manually edited if necessaryManually edited if necessary Significant reduction of effortSignificant reduction of effort

Page 7: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

Live Captioning of TV ProgramsLive Captioning of TV Programs Challenge is to produce captions Challenge is to produce captions

with a minimum delaywith a minimum delay Delay should be less than a few Delay should be less than a few

secondsseconds Live is essential for sport events, Live is essential for sport events,

parliamentary broadcast, live parliamentary broadcast, live news, etc.news, etc.

Highly skilled stenographers (200 Highly skilled stenographers (200 words/minute)words/minute)

Have to have a break every 15 Have to have a break every 15 minutes.minutes.

Difficult to find enough people for Difficult to find enough people for large events (Olympic games, large events (Olympic games, World Championship Soccer, …)World Championship Soccer, …)

Page 8: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

BBC started live captioning using speech BBC started live captioning using speech recognition in April 2001 []recognition in April 2001 []

Since May 2008 100% of its broadcasted Since May 2008 100% of its broadcasted programs are captionedprograms are captioned

Typical scheme:Typical scheme: a captioner listens to the live programa captioner listens to the live program Re-speak, rephrase and simplify the spoken speechRe-speak, rephrase and simplify the spoken speech Enter punctuation and speaker-change infoEnter punctuation and speaker-change info Speech Recognition engine is adapted to the voice of Speech Recognition engine is adapted to the voice of

the captionerthe captioner

Page 9: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

Czech Television in cooperation with the University of Bohemia in Czech Television in cooperation with the University of Bohemia in Pilsen started in November 2008 to do live captioning of parliament Pilsen started in November 2008 to do live captioning of parliament meetingsmeetings

Pilot study:Pilot study: The speech recognition is done on the original speech signal which The speech recognition is done on the original speech signal which

is sent 5 seconds ahead.is sent 5 seconds ahead. Transcription is sent back immediatelyTranscription is sent back immediately System is trained on 100 hours of Czech Parliament broadcast with System is trained on 100 hours of Czech Parliament broadcast with

their stenographic recordstheir stenographic recordsNear futureNear future Re-speakers will be used for sports programs, and other live eventsRe-speakers will be used for sports programs, and other live events Note: Slavic languages have the difficulty of having high degree of Note: Slavic languages have the difficulty of having high degree of

inflection, and large numbers of pre- and suf-fixes => much larger inflection, and large numbers of pre- and suf-fixes => much larger vocabularyvocabulary

Page 10: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Live TV CaptioningLive TV Captioning

Application: Automatic TranslationApplication: Automatic Translation Youtube offers translations of the captions in 41 Youtube offers translations of the captions in 41

languageslanguages Translation modelingTranslation modeling Robust error handling of transcription errors Robust error handling of transcription errors

made by the sp[eech recognition enginemade by the sp[eech recognition engine More difficult for languages that are further More difficult for languages that are further

apart: English – Japanese, Chineseapart: English – Japanese, Chinese Although with errors, it still may be beneficial: Although with errors, it still may be beneficial:

human interpretable human interpretable

Page 11: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

ReferencesReferences

[1] F. Jurcicek, Speech Recognition for Live TV, IEEE Signal [1] F. Jurcicek, Speech Recognition for Live TV, IEEE Signal Processing Society – SLTC Newsletter, April 2009Processing Society – SLTC Newsletter, April 2009

[2] M.J. Evans: Speech Recognition in Assited and Live Subtitling for [2] M.J. Evans: Speech Recognition in Assited and Live Subtitling for Television, BBC R&D White Paper WHP 065, July 2003 Television, BBC R&D White Paper WHP 065, July 2003

[3] B. Kothari, A. Pandey, and A.R. Chudgar: Reading Out of the "Idiot [3] B. Kothari, A. Pandey, and A.R. Chudgar: Reading Out of the "Idiot Box": Same-Language Subtitling on Television in India, Information Box": Same-Language Subtitling on Television in India, Information Technologies and International Development, Volume 2, Issue 1, Technologies and International Development, Volume 2, Issue 1, September 2004 September 2004

[4] Ofcom: Television access services - Summary, March 2006 [4] Ofcom: Television access services - Summary, March 2006 [5] R. Griffiths: Giving voice to subtitling: Ruth Griffiths, director of [5] R. Griffiths: Giving voice to subtitling: Ruth Griffiths, director of

Access Services at BBC Broadcast, July 2005 Access Services at BBC Broadcast, July 2005 [6] BBC: Press Releases - BBC Vision celebrates 100% subtitling, May [6] BBC: Press Releases - BBC Vision celebrates 100% subtitling, May

2008 2008 [7] Blog YouTube: Auto Translate Now Available For Videos With [7] Blog YouTube: Auto Translate Now Available For Videos With

Captions, November 2008 Captions, November 2008

Page 12: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Fear-Type Emotion Recognition [1]Fear-Type Emotion Recognition [1]

Fear-Type emotion recognition Fear-Type emotion recognition for future audio based for future audio based surveillance systemssurveillance systems

Fear-Type emotions expressed Fear-Type emotions expressed during abnormal situations, possibly life during abnormal situations, possibly life threatening => Public Safety (SERKET Project threatening => Public Safety (SERKET Project addressing multi sensory data)addressing multi sensory data)

Specially developed corpus SAFESpecially developed corpus SAFE

Page 13: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Fear-Type Emotion RecognitionFear-Type Emotion Recognition

The HUMAINE network of excellence The HUMAINE network of excellence evaluated emotional databases and noted evaluated emotional databases and noted the lack of corpora that contained strong the lack of corpora that contained strong emotions with a good level of realism.emotions with a good level of realism.

(Justin, Laukka, 2003): from 104 studies, (Justin, Laukka, 2003): from 104 studies, 87% of the experiments conducted on 87% of the experiments conducted on acted dataacted data

For strong emotions this percentage is For strong emotions this percentage is almost 100%.almost 100%.

Page 14: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Fear-Type Emotion RecognitionFear-Type Emotion RecognitionActed databasesActed databases StereotypeStereotype Realism depends on acting skillsRealism depends on acting skills Depends on the context given to the actorDepends on the context given to the actor Recently (Banziger and Priker, 2006), (Enos and Hirschberg, 2006), more Recently (Banziger and Priker, 2006), (Enos and Hirschberg, 2006), more

realistic emotions captured by application of certain acting techniques for realistic emotions captured by application of certain acting techniques for more genuine emotionsmore genuine emotions

Induce realistic emotionsInduce realistic emotions eWIZ database (Auberge et al. 2003), SAL (Douglas-Cowie et al. 2003). eWIZ database (Auberge et al. 2003), SAL (Douglas-Cowie et al. 2003). Fear type emotions: maybe medically dangerous, and/or unethical. Fear type emotions: maybe medically dangerous, and/or unethical.

Therefore not available.Therefore not available.

Real-Life databasesReal-Life databases Contain strong emotionsContain strong emotions Very typical: emergency call center and therapy sessionsVery typical: emergency call center and therapy sessions Restricted scope of conceptsRestricted scope of concepts

Page 15: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Annotation of Emotional ContentAnnotation of Emotional ContentScherer et al. (1980) push/pull opposite effects in emotional speech.Scherer et al. (1980) push/pull opposite effects in emotional speech. Physiological excitations push the voice into one directionPhysiological excitations push the voice into one direction Conscious cultural driven effects pull in another directionConscious cultural driven effects pull in another direction

Represent Emotions in Abstract DimensionsRepresent Emotions in Abstract Dimensions Various dimensions have been proposedVarious dimensions have been proposed Whissel (1989) activation/evaluation space is used most frequently because of th Whissel (1989) activation/evaluation space is used most frequently because of th

elarge range of emotional variationelarge range of emotional variation

Emotional Description CategoriesEmotional Description Categories Basic emotions, Orthony and Turner (1990)Basic emotions, Orthony and Turner (1990) Primary emotions, Damasio (1994)Primary emotions, Damasio (1994) Ekman and Friesen (1975), the big six, (fear, anger, joy, disgust, sadness, surprise)Ekman and Friesen (1975), the big six, (fear, anger, joy, disgust, sadness, surprise) And more fuller richer lists of ‘basic’ emotions.And more fuller richer lists of ‘basic’ emotions.

ChallengesChallenges Real-life emotions are rarely basic emotions, much more a complex blend.Real-life emotions are rarely basic emotions, much more a complex blend. This makes annotating real-life corpora a difficult task. This makes annotating real-life corpora a difficult task. Diversity of data.Diversity of data. Not living up to industrial expectations.Not living up to industrial expectations. EARL (Emotion and Annotation Representation Language) by the W3C EARL (Emotion and Annotation Representation Language) by the W3C

Page 16: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Acoustic FeaturesAcoustic Features

Physiologically related and salient for fear characterization.Physiologically related and salient for fear characterization.Voice quality features:Voice quality features:

High Level FeaturesHigh Level Features PitchPitch IntensityIntensity Speech rateSpeech rate Creaky, breathy, tensedCreaky, breathy, tensed

Initially used for speech processing but also for emotion recognition:Initially used for speech processing but also for emotion recognition:

Low Level FeaturesLow Level Features Spectral featureSpectral feature Cepstral featuresCepstral features

Page 17: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Classification AlgorithmsClassification Algorithms

For emotion recognition studies used:For emotion recognition studies used: Support Vector MachineSupport Vector Machine Gaussian Mixture ModelsGaussian Mixture Models Hidden Markov ModelsHidden Markov Models K-Nearest NeighborsK-Nearest Neighbors ……

Evaluation of the methods is difficult:Evaluation of the methods is difficult: Diversity of data and contextDiversity of data and context The different number and type of emotional classesThe different number and type of emotional classes Training and test conditions: speaker dependent or independent, Training and test conditions: speaker dependent or independent,

etc.etc. Acoustic feature extraction which are depending on prior knowledge Acoustic feature extraction which are depending on prior knowledge

on linguistic content, speaker identity, normalization by speaker, on linguistic content, speaker identity, normalization by speaker, phone, etc.phone, etc.

Page 18: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Audio-SurveillanceAudio-Surveillance Audio clues work even if there are no visual clues: gun Audio clues work even if there are no visual clues: gun

shot, human shouts, events out of the camera shotshot, human shouts, events out of the camera shot Important emotional category: Important emotional category: FearFear

ConstraintsConstraints High diversity of number and type of speakersHigh diversity of number and type of speakers Noisy environments: stadium, bank, airport, subway, Noisy environments: stadium, bank, airport, subway,

station, bank, etc.station, bank, etc. Speaker independent, high number of unknown Speaker independent, high number of unknown

speakersspeakers Text-independent (i.e., no speech recognition, quality of Text-independent (i.e., no speech recognition, quality of

recordings is of less importance)recordings is of less importance)

Page 19: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

ApproachApproach

New emotional database with a large New emotional database with a large scope of threat concepts following scope of threat concepts following previously listed constraintspreviously listed constraints

A task dependent annotation strategyA task dependent annotation strategyFeature extraction of relevant acoustic Feature extraction of relevant acoustic

featuresfeaturesClassification robust to noise and Classification robust to noise and

variability of datavariability of dataPerformance evaluationPerformance evaluation

Page 20: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

SAFE Situation Analysis in a SAFE Situation Analysis in a Fictional and Emotional CorpusFictional and Emotional Corpus

7 hours of recordings, fictional movies7 hours of recordings, fictional movies400 audio visual sequences (8 secs – 5 400 audio visual sequences (8 secs – 5

minutes)minutes)Dynamics: normal and abnormal situationDynamics: normal and abnormal situationVariability: high variety of emotional Variability: high variety of emotional

ManifestationsManifestationsLarge number of unknown speaker, Large number of unknown speaker,

unknown situations in noisy environmentunknown situations in noisy environment

Page 21: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

SAFE Situation Analysis in a SAFE Situation Analysis in a Fictional and Emotional CorpusFictional and Emotional Corpus

Page 22: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

AnnotationAnnotation

Speaker track: genre and position: Speaker track: genre and position: aggressor, victim, otheraggressor, victim, other

Threat track: degree of threat (no threat, Threat track: degree of threat (no threat, potential, latent, immediate, past threat) potential, latent, immediate, past threat) plus intensityplus intensity

Speech track: categories verbal and non-Speech track: categories verbal and non-verbal (shouts, breathing, etc.) contents, verbal (shouts, breathing, etc.) contents, audio environment (music/noise), quality audio environment (music/noise), quality of speechof speech

Page 23: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Fear-Type Emotion RecognitionFear-Type Emotion Recognition

Emotional categories and subcategoriesEmotional categories and subcategories

Broad categories Broad categories SubcategoriesSubcategories

Fear Fear Stress, terror, anxiety, worry, Stress, terror, anxiety, worry, anguish, anguish, panic, distress, mixed panic, distress, mixed subcategoriessubcategories

Other negative emotionsOther negative emotions Anger, sadness, disgust, suffering, Anger, sadness, disgust, suffering, deception, contempt, shame, despair, deception, contempt, shame, despair, cruelty, mixed subcategoriescruelty, mixed subcategories

Neutral – Positive emotionsNeutral – Positive emotions Joy, relief, determination, pride, Joy, relief, determination, pride, hope, hope, gratitude, surprise, mixed gratitude, surprise, mixed

subcategoriessubcategories

Page 24: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Fear-Type Emotion RecognitionFear-Type Emotion Recognition

Page 25: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

FeaturesFeatures

Pitch related features (most useful for the Pitch related features (most useful for the fear vs neutral classifier)fear vs neutral classifier)

Voice quality: jitter and shimmerVoice quality: jitter and shimmerVoiced classifier: spectral centroidVoiced classifier: spectral centroidUnvoiced classifier: spectral features, Bark Unvoiced classifier: spectral features, Bark

band energyband energy

Page 26: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

ResultsResults

Page 27: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

ReferencesReferences

[1] C.Clavel, I. Vasilescu, L.Devillers, G. [1] C.Clavel, I. Vasilescu, L.Devillers, G. Richard, T. Ehrette, Fear-type Emotion Richard, T. Ehrette, Fear-type Emotion Recognition for Future Audio-Based Recognition for Future Audio-Based Surveillance Systems, Speech Surveillance Systems, Speech Communication, pp 487-503, Vol. 50, Communication, pp 487-503, Vol. 50, 20082008

Page 28: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Special ClassesSpecial Classes

A. Drahota, A. Costall, V. Reddy, The A. Drahota, A. Costall, V. Reddy, The Vocal Communication of Different kind of Vocal Communication of Different kind of Smiles, Speech Communication, pp. 278 – Smiles, Speech Communication, pp. 278 – 287, Vol. 50, 2008287, Vol. 50, 2008

H. S. Cheang, M. D. Pell, The Sound of H. S. Cheang, M. D. Pell, The Sound of Sarcasm, Speech Communication, pp. Sarcasm, Speech Communication, pp. 366 – 381, Vol. 50, 2008366 – 381, Vol. 50, 2008

Page 29: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Speaker Recognition/VerificationSpeaker Recognition/Verification

Text independent speaker verification systemsText independent speaker verification systems (Reynolds et al. 2000)(Reynolds et al. 2000)

Short-term spectra of speech signals used to Short-term spectra of speech signals used to train speaker dependent Gaussian Mixture train speaker dependent Gaussian Mixture Models (GMMs)Models (GMMs)

A GMM-based background model is used to A GMM-based background model is used to represent the distribution of imposters speech.represent the distribution of imposters speech.

Verification is based on th elikelihood-ratio Verification is based on th elikelihood-ratio hypothesis test wherehypothesis test where Client GMM is distribution of the 0-hypothesesClient GMM is distribution of the 0-hypotheses Background GMM is distribution of the alternative Background GMM is distribution of the alternative

hypotheseshypotheses

Page 30: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Speaker Recognition/VerificationSpeaker Recognition/Verification

Training the background model is relatively Training the background model is relatively straightforward using large speech corpora straightforward using large speech corpora of non-target speakersof non-target speakers

But:But: Speaker verification based on high-level Speaker verification based on high-level speaker features requires a large amount speaker features requires a large amount of speech for enrollment.of speech for enrollment.

Therefore:Therefore: Adaptation techniques are Adaptation techniques are necessarynecessary

Page 31: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Speaker Recognition/VerificationSpeaker Recognition/VerificationText dependent & very short utterances:Text dependent & very short utterances:

Phoneme dependent HMMsPhoneme dependent HMMs

Text Independent & moderate enrollment data (adaptation Text Independent & moderate enrollment data (adaptation techniques)techniques) Maximum a Posteriory (MAP)Maximum a Posteriory (MAP) Maximum Likelihood Linear Regression (MLLR)Maximum Likelihood Linear Regression (MLLR) Low-level speaker models and speaker clusteringLow-level speaker models and speaker clustering Linear combination of reference models in an eigen voice (EV)Linear combination of reference models in an eigen voice (EV) EV adaptation => EVMLLREV adaptation => EVMLLR Non-linear adaptation by introducing kernel PCANon-linear adaptation by introducing kernel PCA

Kernel eigenspace MLLR outperforms other adaptation Kernel eigenspace MLLR outperforms other adaptation

models when the amount of enrollment data is extremely models when the amount of enrollment data is extremely limitedlimited

Page 32: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Speaker Recognition/VerificationSpeaker Recognition/Verification

Features from high- to lower-level Features Features from high- to lower-level Features (Shriberg, 2007)(Shriberg, 2007)

Pronunciations (Place of birth, education, Pronunciations (Place of birth, education, socioeconomic status, etc.)socioeconomic status, etc.)

Idiolect (Education, socioecconomic status, etc.)Idiolect (Education, socioecconomic status, etc.) Prosodic (Personality type, parental influence, Prosodic (Personality type, parental influence,

etc.)etc.) Acoustic (Physical structure of vocal organs)Acoustic (Physical structure of vocal organs)

Page 33: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Speaker Recognition/VerificationSpeaker Recognition/Verification

Pronunciations: Pronunciations: Multilingual phone streams obtained by language dependent phone Multilingual phone streams obtained by language dependent phone

ASR (N-grams, Bin-tree)ASR (N-grams, Bin-tree) Multilingual phone cross-streams by language dependent phone Multilingual phone cross-streams by language dependent phone

ASR (N-gram, CPM)ASR (N-gram, CPM) Articulatory features by MLP and phone ASR (AFCPM)Articulatory features by MLP and phone ASR (AFCPM)Idiolect (Education, socioecconomic status, etc.)Idiolect (Education, socioecconomic status, etc.) Word streams by word ASR (N-gram, SVM)Word streams by word ASR (N-gram, SVM)Prosodic (Personality type, parental influence, etc.)Prosodic (Personality type, parental influence, etc.) F0 and Energy distribution by Energy estimator (GMM)F0 and Energy distribution by Energy estimator (GMM) Pitch contour by F0 estimator and word ASR (DTW)Pitch contour by F0 estimator and word ASR (DTW) F0 & Energy contour & duration dynamics by F0 & energy estimator F0 & Energy contour & duration dynamics by F0 & energy estimator

& phone ASR (N-gram)& phone ASR (N-gram) Prosodic statistics from F0 & duration by F0 and energy estimator & Prosodic statistics from F0 & duration by F0 and energy estimator &

word ASR (KNN)word ASR (KNN)Acoustic (Physical structure of vocal organs)Acoustic (Physical structure of vocal organs) MFCC & its time derivatives (GMM)MFCC & its time derivatives (GMM)

Page 34: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Training of Unadapted Phoneme Training of Unadapted Phoneme Dependent AFCPM Speaker ModelsDependent AFCPM Speaker Models

Page 35: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Adaptation Method AAdaptation Method A

Classical MAP. Adapted from phoneme Classical MAP. Adapted from phoneme dependent background models.dependent background models.

This is based on the classical MAP used in This is based on the classical MAP used in (Leung et al., 2006).(Leung et al., 2006).

Page 36: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Adaptation Method BAdaptation Method B

Method B: Phoneme-independent Method B: Phoneme-independent adaptation (PIA).adaptation (PIA).

Adapted from phoneme-dependent Adapted from phoneme-dependent speaker models and phoneme-speaker models and phoneme-independent speaker models.independent speaker models.

Page 37: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Adaptation Method CAdaptation Method C

Scaled phoneme-independent adaptation Scaled phoneme-independent adaptation (SPI).(SPI).

Adapted from phoneme-independent Adapted from phoneme-independent speaker models with a phoneme-speaker models with a phoneme-dependent scaling factor that depends on dependent scaling factor that depends on both the phoneme-dependent and both the phoneme-dependent and phoneme-independent background phoneme-independent background models.models.

Page 38: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Adaptation Method DAdaptation Method D

Mixed phoneme-dependent and scaled Mixed phoneme-dependent and scaled phoneme independent adaptation (MSPI). phoneme independent adaptation (MSPI).

Adapted from phoneme-dependent background Adapted from phoneme-dependent background models and phoneme-independent speakermodels and phoneme-independent speaker

models with a phoneme-dependent scaling models with a phoneme-dependent scaling factor that depends on both the phoneme-factor that depends on both the phoneme-dependent and phoneme-independent dependent and phoneme-independent background models. background models.

This method is a combination of Methods A and This method is a combination of Methods A and C.C.

Page 39: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Adaptation Method EAdaptation Method E

Method E: Mixed phoneme-independent and Method E: Mixed phoneme-independent and scaled phoneme-dependent adaptation (MSPD).scaled phoneme-dependent adaptation (MSPD).

Adapted from phoneme-independent speaker Adapted from phoneme-independent speaker models and phoneme-dependent background models and phoneme-dependent background models with a speakerdependent scaling factor models with a speakerdependent scaling factor that depends on both the phoneme-independent that depends on both the phoneme-independent speaker model and background models. speaker model and background models.

This method is a combination of Methods B and This method is a combination of Methods B and C.C.

Page 40: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Adaptation Methods for AFCPMsAdaptation Methods for AFCPMs

Page 41: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Method AMethod A

Page 42: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Method A Principals RelationshipsMethod A Principals Relationships

Page 43: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Method B - EMethod B - E

Page 44: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Method B Principals RelationshipsMethod B Principals Relationships

Page 45: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Method C Principals RelationshipsMethod C Principals Relationships

Page 46: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Method D Principals RelationshipsMethod D Principals Relationships

Page 47: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Method E Principals RelationshipsMethod E Principals Relationships

Page 48: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Used DatabasesUsed Databases

Page 49: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Results NIST00Results NIST00

Page 50: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Results NIST02Results NIST02

Page 51: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Results NIST00Results NIST00

Page 52: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Results NIST02Results NIST02

Page 53: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

ReferencesReferences

[1] L. Mary, B. Yegnanarayana, Extraction and [1] L. Mary, B. Yegnanarayana, Extraction and Represenation of Prosodic Features for Represenation of Prosodic Features for Language and Speaker Recognition, Speech Language and Speaker Recognition, Speech Communication, pp. 782-796, Vol. 50, 2008.Communication, pp. 782-796, Vol. 50, 2008.

[2] S.-X. Zhang, M.-W. Mak, A New Adaptation [2] S.-X. Zhang, M.-W. Mak, A New Adaptation Approach to High-Level Speaker-Model Creation Approach to High-Level Speaker-Model Creation in Speaker Verification, Speech Communication, in Speaker Verification, Speech Communication, pp. 534-550, Vol. 51, 2009pp. 534-550, Vol. 51, 2009

Page 54: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

Detecting Speech and Music based Detecting Speech and Music based on Spectral Tracking [1]on Spectral Tracking [1]

Page 55: Automatic Speech Recognition and Audio Indexing E.M. Bakker LIACS Media Lab

ReferencesReferences

[1] T. Taniguchi, M. Tohyama, K. Shirai, [1] T. Taniguchi, M. Tohyama, K. Shirai, Detection of Speech and Music based on Detection of Speech and Music based on Spectral Tracking, Speech Spectral Tracking, Speech Communication, pp. 547 – 563, Vol. 50, Communication, pp. 547 – 563, Vol. 50, 20082008