55
Automatic Speech Automatic Speech Recognition Recognition and and Audio Indexing Audio Indexing E.M. Bakker E.M. Bakker LIACS Media Lab LIACS Media Lab

Automatic Speech Recognition and Audio Indexing

  • Upload
    teenie

  • View
    74

  • Download
    0

Embed Size (px)

DESCRIPTION

Automatic Speech Recognition and Audio Indexing. E.M. Bakker LIACS Media Lab. Topics. Live TV Caption Audio-based Surveillance Systems Speaker Recognition Music/Speech Segmentation. Live TV Captioning. TV Captioning transcribe and display the spoken parts of television programs - PowerPoint PPT Presentation

Citation preview

Page 1: Automatic Speech Recognition  and  Audio Indexing

Automatic Speech Automatic Speech Recognition Recognition

and and Audio IndexingAudio Indexing

E.M. BakkerE.M. BakkerLIACS Media LabLIACS Media Lab

Page 2: Automatic Speech Recognition  and  Audio Indexing

TopicsTopicsLive TV CaptionLive TV CaptionAudio-based Surveillance SystemsAudio-based Surveillance SystemsSpeaker RecognitionSpeaker RecognitionMusic/Speech SegmentationMusic/Speech Segmentation

Page 3: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV CaptioningTV Captioning TV Captioning transcribe and display the spoken parts of transcribe and display the spoken parts of

television programstelevision programsBBC pioneer started live captioning in BBC pioneer started live captioning in

2001 and is captioning all of its 2001 and is captioning all of its broadcasted programs since May 2008broadcasted programs since May 2008

Czech Television live captions for Czech Television live captions for Parliament broadcasts since November Parliament broadcasts since November 20082008

Page 4: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV Captioning Essential for people with hearing ompairmentEssential for people with hearing ompairment Helps non-native viewers with their languageHelps non-native viewers with their language Improve first-language literacyImprove first-language literacy Studies showStudies show

In UK 80% of viewers of TV with captions have no In UK 80% of viewers of TV with captions have no hearing impairment []hearing impairment []

In India watching TV with captions improved literacy In India watching TV with captions improved literacy skills []skills []

Page 5: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV Captioning Manual transcription of pre-recorded programsManual transcription of pre-recorded programs

Human captioner listens to and transcribes all speech Human captioner listens to and transcribes all speech in a programin a program

Transcription is edited, positioned and alignedTranscription is edited, positioned and aligned 16 hours work for 1 hour program16 hours work for 1 hour program

Page 6: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV Captioning

Captioning of pre-recorded Captioning of pre-recorded programs using programs using Speech RecognitionSpeech Recognition

Human dictates the speech audio of the Human dictates the speech audio of the programprogram

Transcription is produced by speech Transcription is produced by speech recognitionrecognition

Manually edited if necessaryManually edited if necessary Significant reduction of effortSignificant reduction of effort

Page 7: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV CaptioningLive Captioning of TV ProgramsLive Captioning of TV Programs Challenge is to produce captions Challenge is to produce captions

with a minimum delaywith a minimum delay Delay should be less than a few Delay should be less than a few

secondsseconds Live is essential for sport events, Live is essential for sport events,

parliamentary broadcast, live parliamentary broadcast, live news, etc.news, etc.

Highly skilled stenographers (200 Highly skilled stenographers (200 words/minute)words/minute)

Have to have a break every 15 Have to have a break every 15 minutes.minutes.

Difficult to find enough people for Difficult to find enough people for large events (Olympic games, large events (Olympic games, World Championship Soccer, …)World Championship Soccer, …)

Page 8: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV Captioning BBC started live captioning using speech BBC started live captioning using speech

recognition in April 2001 []recognition in April 2001 [] Since May 2008 100% of its broadcasted Since May 2008 100% of its broadcasted

programs are captionedprograms are captioned Typical scheme:Typical scheme:

a captioner listens to the live programa captioner listens to the live program Re-speak, rephrase and simplify the spoken speechRe-speak, rephrase and simplify the spoken speech Enter punctuation and speaker-change infoEnter punctuation and speaker-change info Speech Recognition engine is adapted to the voice of Speech Recognition engine is adapted to the voice of

the captionerthe captioner

Page 9: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV CaptioningCzech Television in cooperation with the University of Bohemia in Czech Television in cooperation with the University of Bohemia in

Pilsen started in November 2008 to do live captioning of parliament Pilsen started in November 2008 to do live captioning of parliament meetingsmeetings

Pilot study:Pilot study: The speech recognition is done on the original speech signal which The speech recognition is done on the original speech signal which

is sent 5 seconds ahead.is sent 5 seconds ahead. Transcription is sent back immediatelyTranscription is sent back immediately System is trained on 100 hours of Czech Parliament broadcast with System is trained on 100 hours of Czech Parliament broadcast with

their stenographic recordstheir stenographic recordsNear futureNear future Re-speakers will be used for sports programs, and other live eventsRe-speakers will be used for sports programs, and other live events Note: Slavic languages have the difficulty of having high degree of Note: Slavic languages have the difficulty of having high degree of

inflection, and large numbers of pre- and suf-fixes => much larger inflection, and large numbers of pre- and suf-fixes => much larger vocabularyvocabulary

Page 10: Automatic Speech Recognition  and  Audio Indexing

Live TV CaptioningLive TV Captioning

Application: Automatic TranslationApplication: Automatic Translation Youtube offers translations of the captions in 41 Youtube offers translations of the captions in 41

languageslanguages Translation modelingTranslation modeling Robust error handling of transcription errors Robust error handling of transcription errors

made by the sp[eech recognition enginemade by the sp[eech recognition engine More difficult for languages that are further More difficult for languages that are further

apart: English – Japanese, Chineseapart: English – Japanese, Chinese Although with errors, it still may be beneficial: Although with errors, it still may be beneficial:

human interpretable human interpretable

Page 11: Automatic Speech Recognition  and  Audio Indexing

ReferencesReferences[1] F. Jurcicek, Speech Recognition for Live TV, IEEE Signal [1] F. Jurcicek, Speech Recognition for Live TV, IEEE Signal

Processing Society – SLTC Newsletter, April 2009Processing Society – SLTC Newsletter, April 2009[2] M.J. Evans: Speech Recognition in Assited and Live Subtitling for [2] M.J. Evans: Speech Recognition in Assited and Live Subtitling for

Television, BBC R&D White Paper WHP 065, July 2003 Television, BBC R&D White Paper WHP 065, July 2003 [3] B. Kothari, A. Pandey, and A.R. Chudgar: Reading Out of the "Idiot [3] B. Kothari, A. Pandey, and A.R. Chudgar: Reading Out of the "Idiot

Box": Same-Language Subtitling on Television in India, Information Box": Same-Language Subtitling on Television in India, Information Technologies and International Development, Volume 2, Issue 1, Technologies and International Development, Volume 2, Issue 1, September 2004 September 2004

[4] Ofcom: Television access services - Summary, March 2006 [4] Ofcom: Television access services - Summary, March 2006 [5] R. Griffiths: Giving voice to subtitling: Ruth Griffiths, director of [5] R. Griffiths: Giving voice to subtitling: Ruth Griffiths, director of

Access Services at BBC Broadcast, July 2005 Access Services at BBC Broadcast, July 2005 [6] BBC: Press Releases - BBC Vision celebrates 100% subtitling, May [6] BBC: Press Releases - BBC Vision celebrates 100% subtitling, May

2008 2008 [7] Blog YouTube: Auto Translate Now Available For Videos With [7] Blog YouTube: Auto Translate Now Available For Videos With

Captions, November 2008 Captions, November 2008

Page 12: Automatic Speech Recognition  and  Audio Indexing

Fear-Type Emotion Recognition [1]Fear-Type Emotion Recognition [1]

Fear-Type emotion recognition Fear-Type emotion recognition for future audio based for future audio based surveillance systemssurveillance systems

Fear-Type emotions expressed Fear-Type emotions expressed during abnormal situations, possibly life during abnormal situations, possibly life threatening => Public Safety (SERKET Project threatening => Public Safety (SERKET Project addressing multi sensory data)addressing multi sensory data)

Specially developed corpus SAFESpecially developed corpus SAFE

Page 13: Automatic Speech Recognition  and  Audio Indexing

Fear-Type Emotion RecognitionFear-Type Emotion RecognitionThe HUMAINE network of excellence The HUMAINE network of excellence

evaluated emotional databases and noted evaluated emotional databases and noted the lack of corpora that contained strong the lack of corpora that contained strong emotions with a good level of realism.emotions with a good level of realism.

(Justin, Laukka, 2003): from 104 studies, (Justin, Laukka, 2003): from 104 studies, 87% of the experiments conducted on 87% of the experiments conducted on acted dataacted data

For strong emotions this percentage is For strong emotions this percentage is almost 100%.almost 100%.

Page 14: Automatic Speech Recognition  and  Audio Indexing

Fear-Type Emotion RecognitionFear-Type Emotion RecognitionActed databasesActed databases StereotypeStereotype Realism depends on acting skillsRealism depends on acting skills Depends on the context given to the actorDepends on the context given to the actor Recently (Banziger and Priker, 2006), (Enos and Hirschberg, 2006), more Recently (Banziger and Priker, 2006), (Enos and Hirschberg, 2006), more

realistic emotions captured by application of certain acting techniques for realistic emotions captured by application of certain acting techniques for more genuine emotionsmore genuine emotions

Induce realistic emotionsInduce realistic emotions eWIZ database (Auberge et al. 2003), SAL (Douglas-Cowie et al. 2003). eWIZ database (Auberge et al. 2003), SAL (Douglas-Cowie et al. 2003). Fear type emotions: maybe medically dangerous, and/or unethical. Fear type emotions: maybe medically dangerous, and/or unethical.

Therefore not available.Therefore not available.

Real-Life databasesReal-Life databases Contain strong emotionsContain strong emotions Very typical: emergency call center and therapy sessionsVery typical: emergency call center and therapy sessions Restricted scope of conceptsRestricted scope of concepts

Page 15: Automatic Speech Recognition  and  Audio Indexing

Annotation of Emotional ContentAnnotation of Emotional ContentScherer et al. (1980) push/pull opposite effects in emotional speech.Scherer et al. (1980) push/pull opposite effects in emotional speech. Physiological excitations push the voice into one directionPhysiological excitations push the voice into one direction Conscious cultural driven effects pull in another directionConscious cultural driven effects pull in another direction

Represent Emotions in Abstract DimensionsRepresent Emotions in Abstract Dimensions Various dimensions have been proposedVarious dimensions have been proposed Whissel (1989) activation/evaluation space is used most frequently because of th Whissel (1989) activation/evaluation space is used most frequently because of th

elarge range of emotional variationelarge range of emotional variation

Emotional Description CategoriesEmotional Description Categories Basic emotions, Orthony and Turner (1990)Basic emotions, Orthony and Turner (1990) Primary emotions, Damasio (1994)Primary emotions, Damasio (1994) Ekman and Friesen (1975), the big six, (fear, anger, joy, disgust, sadness, surprise)Ekman and Friesen (1975), the big six, (fear, anger, joy, disgust, sadness, surprise) And more fuller richer lists of ‘basic’ emotions.And more fuller richer lists of ‘basic’ emotions.

ChallengesChallenges Real-life emotions are rarely basic emotions, much more a complex blend.Real-life emotions are rarely basic emotions, much more a complex blend. This makes annotating real-life corpora a difficult task. This makes annotating real-life corpora a difficult task. Diversity of data.Diversity of data. Not living up to industrial expectations.Not living up to industrial expectations. EARL (Emotion and Annotation Representation Language) by the W3C EARL (Emotion and Annotation Representation Language) by the W3C

Page 16: Automatic Speech Recognition  and  Audio Indexing

Acoustic FeaturesAcoustic FeaturesPhysiologically related and salient for fear characterization.Physiologically related and salient for fear characterization.Voice quality features:Voice quality features:

High Level FeaturesHigh Level Features PitchPitch IntensityIntensity Speech rateSpeech rate Creaky, breathy, tensedCreaky, breathy, tensed

Initially used for speech processing but also for emotion recognition:Initially used for speech processing but also for emotion recognition:

Low Level FeaturesLow Level Features Spectral featureSpectral feature Cepstral featuresCepstral features

Page 17: Automatic Speech Recognition  and  Audio Indexing

Classification AlgorithmsClassification AlgorithmsFor emotion recognition studies used:For emotion recognition studies used: Support Vector MachineSupport Vector Machine Gaussian Mixture ModelsGaussian Mixture Models Hidden Markov ModelsHidden Markov Models K-Nearest NeighborsK-Nearest Neighbors ……

Evaluation of the methods is difficult:Evaluation of the methods is difficult: Diversity of data and contextDiversity of data and context The different number and type of emotional classesThe different number and type of emotional classes Training and test conditions: speaker dependent or independent, Training and test conditions: speaker dependent or independent,

etc.etc. Acoustic feature extraction which are depending on prior knowledge Acoustic feature extraction which are depending on prior knowledge

on linguistic content, speaker identity, normalization by speaker, on linguistic content, speaker identity, normalization by speaker, phone, etc.phone, etc.

Page 18: Automatic Speech Recognition  and  Audio Indexing

Audio-SurveillanceAudio-Surveillance Audio clues work even if there are no visual clues: gun Audio clues work even if there are no visual clues: gun

shot, human shouts, events out of the camera shotshot, human shouts, events out of the camera shot Important emotional category: Important emotional category: FearFear

ConstraintsConstraints High diversity of number and type of speakersHigh diversity of number and type of speakers Noisy environments: stadium, bank, airport, subway, Noisy environments: stadium, bank, airport, subway,

station, bank, etc.station, bank, etc. Speaker independent, high number of unknown Speaker independent, high number of unknown

speakersspeakers Text-independent (i.e., no speech recognition, quality of Text-independent (i.e., no speech recognition, quality of

recordings is of less importance)recordings is of less importance)

Page 19: Automatic Speech Recognition  and  Audio Indexing

ApproachApproachNew emotional database with a large New emotional database with a large

scope of threat concepts following scope of threat concepts following previously listed constraintspreviously listed constraints

A task dependent annotation strategyA task dependent annotation strategyFeature extraction of relevant acoustic Feature extraction of relevant acoustic

featuresfeaturesClassification robust to noise and Classification robust to noise and

variability of datavariability of dataPerformance evaluationPerformance evaluation

Page 20: Automatic Speech Recognition  and  Audio Indexing

SAFE Situation Analysis in a SAFE Situation Analysis in a Fictional and Emotional CorpusFictional and Emotional Corpus

7 hours of recordings, fictional movies7 hours of recordings, fictional movies 400 audio visual sequences (8 secs – 5 400 audio visual sequences (8 secs – 5

minutes)minutes) Dynamics: normal and abnormal situationDynamics: normal and abnormal situation Variability: high variety of emotional Variability: high variety of emotional

ManifestationsManifestations Large number of unknown speaker, Large number of unknown speaker,

unknown situations in noisy environmentunknown situations in noisy environment

Page 21: Automatic Speech Recognition  and  Audio Indexing

SAFE Situation Analysis in a SAFE Situation Analysis in a Fictional and Emotional CorpusFictional and Emotional Corpus

Page 22: Automatic Speech Recognition  and  Audio Indexing

AnnotationAnnotationSpeaker track: genre and position: Speaker track: genre and position:

aggressor, victim, otheraggressor, victim, otherThreat track: degree of threat (no threat, Threat track: degree of threat (no threat,

potential, latent, immediate, past threat) potential, latent, immediate, past threat) plus intensityplus intensity

Speech track: categories verbal and non-Speech track: categories verbal and non-verbal (shouts, breathing, etc.) contents, verbal (shouts, breathing, etc.) contents, audio environment (music/noise), quality audio environment (music/noise), quality of speechof speech

Page 23: Automatic Speech Recognition  and  Audio Indexing

Fear-Type Emotion RecognitionFear-Type Emotion RecognitionEmotional categories and subcategoriesEmotional categories and subcategories

Broad categories Broad categories SubcategoriesSubcategories

Fear Fear Stress, terror, anxiety, worry, Stress, terror, anxiety, worry, anguish, anguish, panic, distress, mixed panic, distress, mixed subcategoriessubcategories

Other negative emotionsOther negative emotions Anger, sadness, disgust, suffering, Anger, sadness, disgust, suffering, deception, contempt, shame, despair, deception, contempt, shame, despair, cruelty, mixed subcategoriescruelty, mixed subcategories

Neutral – Positive emotionsNeutral – Positive emotions Joy, relief, determination, pride, Joy, relief, determination, pride, hope, hope, gratitude, surprise, mixed gratitude, surprise, mixed

subcategoriessubcategories

Page 24: Automatic Speech Recognition  and  Audio Indexing

Fear-Type Emotion RecognitionFear-Type Emotion Recognition

Page 25: Automatic Speech Recognition  and  Audio Indexing

FeaturesFeaturesPitch related features (most useful for the Pitch related features (most useful for the

fear vs neutral classifier)fear vs neutral classifier)Voice quality: jitter and shimmerVoice quality: jitter and shimmerVoiced classifier: spectral centroidVoiced classifier: spectral centroidUnvoiced classifier: spectral features, Bark Unvoiced classifier: spectral features, Bark

band energyband energy

Page 26: Automatic Speech Recognition  and  Audio Indexing

ResultsResults

Page 27: Automatic Speech Recognition  and  Audio Indexing

ReferencesReferences

[1] C.Clavel, I. Vasilescu, L.Devillers, G. [1] C.Clavel, I. Vasilescu, L.Devillers, G. Richard, T. Ehrette, Fear-type Emotion Richard, T. Ehrette, Fear-type Emotion Recognition for Future Audio-Based Recognition for Future Audio-Based Surveillance Systems, Speech Surveillance Systems, Speech Communication, pp 487-503, Vol. 50, Communication, pp 487-503, Vol. 50, 20082008

Page 28: Automatic Speech Recognition  and  Audio Indexing

Special ClassesSpecial ClassesA. Drahota, A. Costall, V. Reddy, The A. Drahota, A. Costall, V. Reddy, The

Vocal Communication of Different kind of Vocal Communication of Different kind of Smiles, Speech Communication, pp. 278 – Smiles, Speech Communication, pp. 278 – 287, Vol. 50, 2008287, Vol. 50, 2008

H. S. Cheang, M. D. Pell, The Sound of H. S. Cheang, M. D. Pell, The Sound of Sarcasm, Speech Communication, pp. Sarcasm, Speech Communication, pp. 366 – 381, Vol. 50, 2008366 – 381, Vol. 50, 2008

Page 29: Automatic Speech Recognition  and  Audio Indexing

Speaker Recognition/VerificationSpeaker Recognition/Verification

Text independent speaker verification systemsText independent speaker verification systems (Reynolds et al. 2000)(Reynolds et al. 2000)

Short-term spectra of speech signals used to Short-term spectra of speech signals used to train speaker dependent Gaussian Mixture train speaker dependent Gaussian Mixture Models (GMMs)Models (GMMs)

A GMM-based background model is used to A GMM-based background model is used to represent the distribution of imposters speech.represent the distribution of imposters speech.

Verification is based on th elikelihood-ratio Verification is based on th elikelihood-ratio hypothesis test wherehypothesis test where Client GMM is distribution of the 0-hypothesesClient GMM is distribution of the 0-hypotheses Background GMM is distribution of the alternative Background GMM is distribution of the alternative

hypotheseshypotheses

Page 30: Automatic Speech Recognition  and  Audio Indexing

Speaker Recognition/VerificationSpeaker Recognition/Verification

Training the background model is relatively Training the background model is relatively straightforward using large speech corpora straightforward using large speech corpora of non-target speakersof non-target speakers

But:But: Speaker verification based on high-level Speaker verification based on high-level speaker features requires a large amount speaker features requires a large amount of speech for enrollment.of speech for enrollment.

Therefore:Therefore: Adaptation techniques are Adaptation techniques are necessarynecessary

Page 31: Automatic Speech Recognition  and  Audio Indexing

Speaker Recognition/VerificationSpeaker Recognition/VerificationText dependent & very short utterances:Text dependent & very short utterances:

Phoneme dependent HMMsPhoneme dependent HMMs

Text Independent & moderate enrollment data (adaptation Text Independent & moderate enrollment data (adaptation techniques)techniques) Maximum a Posteriory (MAP)Maximum a Posteriory (MAP) Maximum Likelihood Linear Regression (MLLR)Maximum Likelihood Linear Regression (MLLR) Low-level speaker models and speaker clusteringLow-level speaker models and speaker clustering Linear combination of reference models in an eigen voice (EV)Linear combination of reference models in an eigen voice (EV) EV adaptation => EVMLLREV adaptation => EVMLLR Non-linear adaptation by introducing kernel PCANon-linear adaptation by introducing kernel PCA

Kernel eigenspace MLLR outperforms other adaptation Kernel eigenspace MLLR outperforms other adaptation

models when the amount of enrollment data is extremely models when the amount of enrollment data is extremely limitedlimited

Page 32: Automatic Speech Recognition  and  Audio Indexing

Speaker Recognition/VerificationSpeaker Recognition/Verification

Features from high- to lower-level Features Features from high- to lower-level Features (Shriberg, 2007)(Shriberg, 2007)

Pronunciations (Place of birth, education, Pronunciations (Place of birth, education, socioeconomic status, etc.)socioeconomic status, etc.)

Idiolect (Education, socioecconomic status, etc.)Idiolect (Education, socioecconomic status, etc.) Prosodic (Personality type, parental influence, Prosodic (Personality type, parental influence,

etc.)etc.) Acoustic (Physical structure of vocal organs)Acoustic (Physical structure of vocal organs)

Page 33: Automatic Speech Recognition  and  Audio Indexing

Speaker Recognition/VerificationSpeaker Recognition/VerificationPronunciations: Pronunciations: Multilingual phone streams obtained by language dependent phone Multilingual phone streams obtained by language dependent phone

ASR (N-grams, Bin-tree)ASR (N-grams, Bin-tree) Multilingual phone cross-streams by language dependent phone Multilingual phone cross-streams by language dependent phone

ASR (N-gram, CPM)ASR (N-gram, CPM) Articulatory features by MLP and phone ASR (AFCPM)Articulatory features by MLP and phone ASR (AFCPM)Idiolect (Education, socioecconomic status, etc.)Idiolect (Education, socioecconomic status, etc.) Word streams by word ASR (N-gram, SVM)Word streams by word ASR (N-gram, SVM)Prosodic (Personality type, parental influence, etc.)Prosodic (Personality type, parental influence, etc.) F0 and Energy distribution by Energy estimator (GMM)F0 and Energy distribution by Energy estimator (GMM) Pitch contour by F0 estimator and word ASR (DTW)Pitch contour by F0 estimator and word ASR (DTW) F0 & Energy contour & duration dynamics by F0 & energy estimator F0 & Energy contour & duration dynamics by F0 & energy estimator

& phone ASR (N-gram)& phone ASR (N-gram) Prosodic statistics from F0 & duration by F0 and energy estimator & Prosodic statistics from F0 & duration by F0 and energy estimator &

word ASR (KNN)word ASR (KNN)Acoustic (Physical structure of vocal organs)Acoustic (Physical structure of vocal organs) MFCC & its time derivatives (GMM)MFCC & its time derivatives (GMM)

Page 34: Automatic Speech Recognition  and  Audio Indexing

Training of Unadapted Phoneme Training of Unadapted Phoneme Dependent AFCPM Speaker ModelsDependent AFCPM Speaker Models

Page 35: Automatic Speech Recognition  and  Audio Indexing

Adaptation Method AAdaptation Method AClassical MAP. Adapted from phoneme Classical MAP. Adapted from phoneme

dependent background models.dependent background models.This is based on the classical MAP used in This is based on the classical MAP used in

(Leung et al., 2006).(Leung et al., 2006).

Page 36: Automatic Speech Recognition  and  Audio Indexing

Adaptation Method BAdaptation Method BMethod B: Phoneme-independent Method B: Phoneme-independent

adaptation (PIA).adaptation (PIA).Adapted from phoneme-dependent Adapted from phoneme-dependent

speaker models and phoneme-speaker models and phoneme-independent speaker models.independent speaker models.

Page 37: Automatic Speech Recognition  and  Audio Indexing

Adaptation Method CAdaptation Method CScaled phoneme-independent adaptation Scaled phoneme-independent adaptation

(SPI).(SPI).Adapted from phoneme-independent Adapted from phoneme-independent

speaker models with a phoneme-speaker models with a phoneme-dependent scaling factor that depends on dependent scaling factor that depends on both the phoneme-dependent and both the phoneme-dependent and phoneme-independent background phoneme-independent background models.models.

Page 38: Automatic Speech Recognition  and  Audio Indexing

Adaptation Method DAdaptation Method D Mixed phoneme-dependent and scaled Mixed phoneme-dependent and scaled

phoneme independent adaptation (MSPI). phoneme independent adaptation (MSPI). Adapted from phoneme-dependent background Adapted from phoneme-dependent background

models and phoneme-independent speakermodels and phoneme-independent speaker models with a phoneme-dependent scaling models with a phoneme-dependent scaling

factor that depends on both the phoneme-factor that depends on both the phoneme-dependent and phoneme-independent dependent and phoneme-independent background models. background models.

This method is a combination of Methods A and This method is a combination of Methods A and C.C.

Page 39: Automatic Speech Recognition  and  Audio Indexing

Adaptation Method EAdaptation Method E Method E: Mixed phoneme-independent and Method E: Mixed phoneme-independent and

scaled phoneme-dependent adaptation (MSPD).scaled phoneme-dependent adaptation (MSPD). Adapted from phoneme-independent speaker Adapted from phoneme-independent speaker

models and phoneme-dependent background models and phoneme-dependent background models with a speakerdependent scaling factor models with a speakerdependent scaling factor that depends on both the phoneme-independent that depends on both the phoneme-independent speaker model and background models. speaker model and background models.

This method is a combination of Methods B and This method is a combination of Methods B and C.C.

Page 40: Automatic Speech Recognition  and  Audio Indexing

Adaptation Methods for AFCPMsAdaptation Methods for AFCPMs

Page 41: Automatic Speech Recognition  and  Audio Indexing

Method AMethod A

Page 42: Automatic Speech Recognition  and  Audio Indexing

Method A Principals RelationshipsMethod A Principals Relationships

Page 43: Automatic Speech Recognition  and  Audio Indexing

Method B - EMethod B - E

Page 44: Automatic Speech Recognition  and  Audio Indexing

Method B Principals RelationshipsMethod B Principals Relationships

Page 45: Automatic Speech Recognition  and  Audio Indexing

Method C Principals RelationshipsMethod C Principals Relationships

Page 46: Automatic Speech Recognition  and  Audio Indexing

Method D Principals RelationshipsMethod D Principals Relationships

Page 47: Automatic Speech Recognition  and  Audio Indexing

Method E Principals RelationshipsMethod E Principals Relationships

Page 48: Automatic Speech Recognition  and  Audio Indexing

Used DatabasesUsed Databases

Page 49: Automatic Speech Recognition  and  Audio Indexing

Results NIST00Results NIST00

Page 50: Automatic Speech Recognition  and  Audio Indexing

Results NIST02Results NIST02

Page 51: Automatic Speech Recognition  and  Audio Indexing

Results NIST00Results NIST00

Page 52: Automatic Speech Recognition  and  Audio Indexing

Results NIST02Results NIST02

Page 53: Automatic Speech Recognition  and  Audio Indexing

ReferencesReferences

[1] L. Mary, B. Yegnanarayana, Extraction and [1] L. Mary, B. Yegnanarayana, Extraction and Represenation of Prosodic Features for Represenation of Prosodic Features for Language and Speaker Recognition, Speech Language and Speaker Recognition, Speech Communication, pp. 782-796, Vol. 50, 2008.Communication, pp. 782-796, Vol. 50, 2008.

[2] S.-X. Zhang, M.-W. Mak, A New Adaptation [2] S.-X. Zhang, M.-W. Mak, A New Adaptation Approach to High-Level Speaker-Model Creation Approach to High-Level Speaker-Model Creation in Speaker Verification, Speech Communication, in Speaker Verification, Speech Communication, pp. 534-550, Vol. 51, 2009pp. 534-550, Vol. 51, 2009

Page 54: Automatic Speech Recognition  and  Audio Indexing

Detecting Speech and Music based Detecting Speech and Music based on Spectral Tracking [1]on Spectral Tracking [1]

Page 55: Automatic Speech Recognition  and  Audio Indexing

ReferencesReferences

[1] T. Taniguchi, M. Tohyama, K. Shirai, [1] T. Taniguchi, M. Tohyama, K. Shirai, Detection of Speech and Music based on Detection of Speech and Music based on Spectral Tracking, Speech Spectral Tracking, Speech Communication, pp. 547 – 563, Vol. 50, Communication, pp. 547 – 563, Vol. 50, 20082008