A brief overview of Speech Recognition and Spoken Language Processing

Welcome Introduction and Overview

A brief overview of Speech Recognition and Spoken Language ProcessingAdvanced NLPGuest Lecture August 31Andrew RosenbergSpeech and NLPCommunication in Natural Language

Text:Carefully preparedGrammaticalMachine readableTypos Sometimes OCR or handwriting issues1Speech and NLPCommunication in Natural Language

Speech:SpontaneousLess GrammaticalMachine readablewith > 10% error using on speech recognition.2The traditional view3Text Processing System

Named Entity RecognizerText DocumentsText DocumentsTrainingApplicationThe simplest approach4Text Processing System

Named Entity RecognizerTranscribed DocumentsText DocumentsTrainingApplication

Whats the problem with this?4Speech is errorful text5Text Processing System

Named Entity RecognizerTranscribed DocumentsTranscribedDocumentsTrainingApplication

One better Whats the potential problem here? Use speech for training and testing.5Speech signal can be used6Text Processing System


One better, use transcribed speech AND signal features Whats the potential problem here? Use speech for training and testing.6Hybrid speech signal and text7Text Processing System


Text DocumentsOne better, use transcribed speech AND signal features Whats the potential problem here? Use speech for training and testing.7Speech RecognitionStandard HMM speech recognition.

Front EndAcoustic ModelPronunciation ModelLanguage ModelDecoding8Speech Recognition9

Front EndAcoustic ModelPronunciation ModelLanguage ModelWord SequenceAcoustic Feature VectorPhone LikelihoodsWord LikelihoodsSpeech Recognition10

Front EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelFront EndHow do we convert a wave form into a useful representation?We are looking for a vector of numbers which describe the acoustic contentAssuming 22kHz 16bit sound. Modeling this directly is not feasible...yet11Discrete Cosine TransformEvery wave can be decomposed into component sine or cosine waves.

Fast FourierTransform is used to do this efficiently12

Overlapping framesSpectrograms allow for visual inspection of spectral information.We are looking for a compact, numerical representation13

10ms10ms10ms10ms10msSingle Frame of FFT14

http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.htmlAustralian male /i:/ from heed FFT analysis window 12.8msExample Spectrogram15

Example Spectrogram from Praat15Standard RepresentationMel Frequency Cepstral CoefficientsMFCC

16Pre-EmphasiswindowFFTMel-Filter BanklogFFT-1Deltasenergy12 MFCC12 MFCC12 MFCC12 MFCC1 energy1 energy1 energySpeech Recognition17

Front EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelLanguage ModelWhat is the probability of a sequence of words?

Assume you have a vocabulary of V words.How many possible sequences of N words are there?18

General Language ModelingAny probability calculation can be used here.Class based language models.e.g. Recurrent neural networks

19

Speech Recognition20

Front EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelPronunciation ModelingIdentify the likelihood of a phone sequence given a word sequence.There are many simplifying assumptions in pronunciation modeling.The pronunciation of each word is independent of the previous and following.21

Dictionary as Pronunciation ModelAssume each word has a single pronunciation22IAYCATK AE TTHEDH AHHADH AE DABSURDAH B S ER DYOUY UH DWeighted Dictionary as Pronunciation ModelAllow multiple pronunciations and weight each by their likelihood23IAY.4IIH.6THEDH AH.7THEDH IY.3YOUY UH.5YOUY UW.5

Grapheme to Phoneme conversionWhat about words that you have never seen before? What if you dont think youve seen every possible pronunciation?

How do you pronounce: McKayla? or Zoomba?

Try to learn the phonetics of the language.

24Letter to Sound RulesManually written rules that are able to convert one or more letters to one or more sounds.

T -> /t/H -> /h/TH -> /dh/E -> /e/

These rules can get complicated based on the surrounding context. K is silent when word initial and followed by N.25Speech Recognition26

Language ModelCalculate the probability ofa sequence of wordsFront EndConvert sounds into a sequence of observation vectorsLanguage ModelCalculate the probability of a sequence of wordsPronunciation ModelThe probability of a pronunciation given a wordAcoustic ModelThe probability of a set of observations given a phone labelAcoustic ModelingHidden markov model.Used to model the relationship between two sequences.27

Hidden Markov modelIn a Hidden Markov Model the state sequence is unobserved.Only an observation sequence is available28q1q2q3x1x2x3Hidden Markov modelObservations are MFCC vectorsStates are phone labelsEach state (phone) has an associated GMM modeling the MFCC likelihood29q1q2q3x1x2x3Training acoustic modelsTIMITclose, manual phonetic transcription2342 sentences Extract MFCC vectors from each frame within each phoneFor each phone, train a GMM using Expectation Maximization.These GMM is the Acoustic Model.Common to use 8, or 16 Gaussian Mixture Components.

30Gaussian Mixture Model31

HMM Topology for TrainingRather than having one GMM per phone, it is common for acoustic models to represent each phone as 3 triphones32S1S3S2S4S5

/r/33Speech in Natural Language ProcessingALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILY

This is how much of spoken language processing and NLP treat speech.

There are transcription errors. There is no punctuation. There is no segmentation.

There are grammatical issues. Word choice is different, structure is different.34Speech in Natural Language ProcessingAlso, from the North Station...

(I think the Orange Line runs by there too so you can also catch the Orange Line... )

And then instead of transferring

(um I- you know, the map is really obvious about this but)

Instead of transferring at Park Street, you can transfer at (uh whats the station name) Downtown Crossing and (um) thatll get you back to the Red Line just as easily.This is how much of spoken language processing and NLP treat speech.

There are transcription errors. There is no punctuation. There is no segmentation.

There are grammatical issues. Word choice is different, structure is different.35Spoken Language ProcessingNLP systemIRIEQASummarizationTopic ModelingSpeech RecognitionThis looks fundamentally appealing.

Step 1. Turn speech into text. Step 2. Process the text.36Spoken Language ProcessingNLP systemIRIEQASummarizationTopic ModelingALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILYThis text is incomplete. It doesnt contain all of the components of text that NLP systems expect. Grammaticality, disfluencies, segmentation.37Dealing with Speech ErrorsALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILYRobust NLP systemIRIEQASummarizationTopic ModelingRobust nlp systems.1. Aggressive smoothing2. Partial parsing3. Weight by confidence scores. !38Automatic Speech Recognition AssumptionALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHATS THE STATION NAME DOWNTOWN CROSSING UM AND THATLL GET YOU BACK TO THE RED LINE JUST AS EASILYASR produces a transcript of Speech.ASR is a transcript of text.

All the information that you need is in the transcript.39Automatic Speech Recognition AssumptionRich TranscriptionAlso, from the North Station...

(I think the Orange Line runs by there too so you can also catch the Orange Line... )

And then instead of transferring

(um I- you know, the map is really obvious about this but)

Instead of transferring at Park Street, you can transfer at (uh whats the station name) Downtown Crossing and (um) thatll get you back to the Red Line just as easily.ASR produces a transcript of Speech.Even Rich Transcription is missing information, but its getting closer.

And it requires prosodic analysis.40Decrease WERIncrease RobustnessSpeech as Noisy TextRobust NLP systemIRIEQASummarizationTopic ModelingSpeech RecognitionThe common approach to improving performance on speech data.

Non-grammaticality, disfluencies, neologisms, out-of-domain errors -- inconsistency in productions between training of systems and speech input41Other directions for improvement.Prosodic AnalysisRobust NLP systemIRIEQASummarizationTopic ModelingSpeech RecognitionUse Lattices or N-Best listsProsody captures a lot of whats missing.

Also speaker ID, paralinguistics, etc.Processing SpeechProcessing speech is difficultThere are errors in transcripts.It is not grammaticalThe style (genre) of speech is different from the available (text) training data.Processing speech is easySpeaker informationIntention (sarcasm, certainty, emotion, etc.)Segmentation

42