Speech Recognition - ut

Speech RecognitionTanel Alumäe

Tallinn University of Technology

Automatic Speech Recognition (ASR)● Technology for converting speech to text● Speech recognition doesn’t mean that a

computer “understands you”● Applications

○ Dictating text○ Transcribing large multimedia collections

■ In order to make them indexable, more accessible

○ Human-computer interaction■ Android Assistant, Siri, Amazon Echo,

etc■ Needs many other technologies

besides speech recognition

Why is speech recognition difficult?

● Developing a speech recognition system is complicated, as it requires expert knowledge from various fields

● Speech recognition is difficult for computers because○ Human speech has a lot of variety○ Human speech contains many “mistakes” that a human brain can auto-correct, using world

knowledge○ Human speech is not (only) about recognizing which phonemes were spoken

■ Famous example: “it’s easy to recognize speech” vs “it’s easy to wreck a nice beach”■ Estonian example: /kas:a/ -> “kassa”, “kas sa” or maybe “Gassa”?

○ Computationally complex: for each input utterance, computer has to find the most probable sequence of words out of all possible word combinations

Sources of variability● Acoustic conditions

○ Room acoustics, background noise, other speakers in the background, distance from mic

● Microphone○ Type, directional characteristics, electrical noise, ADC quality, telephone

● Speaker○ Anatomy (size of the speech organs, gender, age)○ Health, pronunciation ○ Social background○ Dialect, linguistic background

● Speaking style○ Dictation○ Spontaneous speech between humans

● Phonetic context: the same phoneme sound differently in different phonetic contexts

Speech signal● Sound is a waveform of changing air

pressure● Speech is a sound produced by speech

organs● Microphone converts air pressure

modulation to voltage modulation● Analog-to-digital converter converts the

continuous voltage signal to digital signal, by sampling the value of the signal after perioding intervals

● Sampling frequency (samples per second):

○ Telephone: 8000 Hz ○ CD: 44100 Hz○ For speech, 16000 Hz is usually sufficient

Feature extraction● Raw audio signal contains a lot of

information (typically 16000 values every second)

● The goal of feature extraction:○ Reduce the amount of information○ Extract information that is important for

distinguishing between speech sounds○ Be robust against noise, channel

distortions, speaker variation

Splitting the audio signal into frames

Short-time spectral analysis

Spectrogram● If we turn the spectrogram 90 degrees up, ● represent the values using grayscale, ● and concatenate the spectrums of all

frames● Then we will get a spectrogram● There are people (phoneticians) who can

„read” the spectrogram, i.e., roughly understand what was said

Filterbanks● Human hearing is less sensitive to higher

frequencies — thus human perception of frequency is nonlinear

● Also, spectrogram contains still too much information

● That’s why, mel-scale filterbanks are used to discretize the spectorgram

● Mel-scale: use narrower bands in lower frequencies and wider bands in upper frequencies

● Each filter collects energy from a number of frequency bands in the spectogram

Feature extraction, summarized

Features● Result: a speech signal (e.g., for an

utterance) is transformed into a sequence of fixed dimensional (typically 40-dimensional) feature vectors

● The length of the sequence varies (based on the length of the input utterance)

● This representation is similar to document representation using word embeddings!

● Many familiar models from NLP (convolutional neural nets, LSTMs) can be also applied for speech!

Speech recognition: metrics● The most common quality metrics for speech recognition is word error rate

(WER)● Distinguishes between 3 types of errors:

○ Insertion (I)○ Deletion (D)○ Substitution (S)

● To find the errors, reference and hypothesis texts are aligned, using a technique called dynamic programming

● WER=(S+D+I)/N , where N is the number of words in the reference

Word error rate, example● Example (** is just a placeholder):

Scores: (#C #S #D #I) 11 1 1 0

REF: aga ma olen aru saanud ET sel nädalal lugedes siin SISEMINISTRI intervjuusid ja

HYP: aga ma olen aru saanud ** sel nädalal lugedes siin SISEMIST intervjuusid ja

● S=1, D=1, I=0, N=13

● WER=(1+1+0)/13=15.3%

End-to-end speech recognition● Model: Listen, Attend and Spell (LAS)● Very similar to neural machine translation● LAS transcribes speech utterances directly

into characters● Consists of two submodules: the listener

and the speller○ The listener is an acoustic model encoder○ The speller is an attention-based character

decoder

LAS: Listener● Input to the Listener is a sequence of ● filterbank features● The Listen model uses many layers of

bidirectional LSTMs with a pyramidal structure

○ So that the model converges faster● In each layer, two consecutive outputs of

the lower layer are concatenated and processed by the BLSTM

○ The pyramidal structure also reduces computational cost

LAS: Attend and Spell● The distribution over output characters y

i

is a function of the decoder state si and context c

i

● The decoder state si is a function of the

previous state si-1

, the previously emitted character y

i-1, an the previous context c

i-1

● The context vector ci is produced by an

attention mechanism from listener’s states h1..v

Attention visualization● On the right: Alignments between

character outputs and audio signal produced by the LAS model for the utterance “how much would a woodchuck chuck”.

● The content based attention mechanism was able to identify the start position in the audio sequence for the first character correctly

● The alignment produced is generally monotonic

Problem with end-to-end speech recognition● End-to-end speech recognition works well if there are thousands of hours of

training data available● Training data is speech with manual transcriptions -- expensive to produce

○ E.g., for Estonian, we use around 300 h of training data

● In order to improve speech recognition, we have to use inductive bias○ That is, use our knowledge/assumptions about how language works, in order to constrain the

freedom of the model

The Noisy Channel Model

● Search through space of all possible sentences● Pick the one that is most probable given the waveform

Noisy Channel, cont.● What is the most likely sentence out of all

sentences in the language L given some acoustic input (feature vectors) O?

● Treat acoustic input O as sequence of individual observations O = o

1,o

2,o

3,…,o

t

● Define a sentence as a sequence of words:W = w

1,w

2,w

3,…,w

n

● Probabilistically: pick W with the highest probability

● We can use the Bayes rule to rewrite this:

● Since denominator is the same for each candidate sentence W, we can ignore it for the argmax (since we only care about finding the sentence with maximum probability, not the actual probability):

Speech recognition with the noisy channel model

Markov models● Markov models resemble weighted finite

state automata● Markov model has a set of states,

transition probabilities between the states (and self-loops), and and special start/end states

Hidden Markov models● Hidden Markov models add two things two

the simple Markov model:○ Observation symbols○ Observation likelihoods for each state

Jason Eisner task● Toy task for understanding HMMs● You are a climatologist in 2799 studying

the history of global warming● You can’t find records of the weather in

Baltimore for summer 2006● But you do find Jason Eisner’s diary● Which records how many ice creams he

ate each day.● Can we use this to figure out the weather?

○ Given a sequence of observations O, ■ each observation an integer =

number of ice creams eaten■ Figure out correct hidden sequence

Q of weather states (H or C) which caused Jason to eat the ice cream

●

HMMs for speech recognition● Phones are not homogeneous● Therefore, each phone is modeled by a

three-state (or two-state) HMM● The states correspond to subphones

(the start state, middle state, end state)● Transition probabilities: how likely is that

the subphone takes a self-loop or goes to the next subphone (sort of duration model)

● Observation likelihoods: how likely each subphone is to generate a certain feature vector

● That is, instead of discrete elements (number of ice-creams), the outputs are 40-dimensional vectors

Acoustic modeling with HMMs● HMMs are used to model different

phonemes● Pronunciation lexicon maps each word in

the language model to a sequence of basic phonemes, e.g.:…ONE W AH N

TWO T UW

THREE TH R IY

...

● Word HMMs are synthesized from phoneme HMMs based on the pronunciation lexicon

● Decoding: find the path that most likely “produced” the feature vector sequence

Context-sensitive phones● Phonemes sound very different, depending on the context

○ Especially start and end○ E.g. /ae/ in word “sat” is different from /ae/ in word “man”

● Therefore we model words using triphones (or biphones)○ Triphone is “phone in a certain context”○ “sat” is synthesized from triphones /sil-s+ae/, /s-ae+t/, /ae-t+sil/

● Each triphone is modeled using three states● In order to reduce the number of different states, states of the same

phoneme is similar contexts are tied together (modelled by the same parameters)

● This gives more training data for each tied state

Context-sensitive phones, cont.● Triphone state clustering is usually done

using a decision tree● Decision tree is learned from data● The resulting tied states are called

senones● Typically, we want to have around 5000

senones in our model (more if we have a lot of training data)

Connecting speech features and HMMs● In hybrid DNN-HMM speech recognition,

the output distribution of HMM states is parametrized by a DNN

● DNN uses a stack of neighbouring acoustic feature vectors as input

● And calculates the probability over the HMM states as output

DNN in the acoustic model● The DNN in the acoustic model has a

similar role as a DNN in image recognition● Given (typically 50) neighbouring acoustic

feature vectors (an image), decide what phoneme state (e.g., beginning of /a/) is in the centre

● The DNN produces a distribution over all phoneme states, given a stack of feature vectors, i.e.,

Flipping probabilities● Since HMM is a generative model, we need to know the probability that a certain phone

state generates a certain feature vector

● However, DNN is discriminative, meaning it gives us the probability that a certain feature vector corresponds to a certain state

● Bayes rule helps to transform the latter to the first:

● That is, we need to divide the outputs of the DNN by the prior probability of states (it’s just their frequency in training data)

DNN architectures for speech recognition● Much of the research in speech

recognition is just looking for DNN architectures that work best for this task

● Currently one of the best working DNN architectures for acoustic modeling are

○ Time-delay neural network (TDNN) = many layers of dilated 1D convolutions

○ The above combined with (bidirectional) LSTM

■ Update: LSTM is nowadays often left out

○ Typically around 5-10 hidden layers

Training DNN models for speech recognition● DNNs used in the acoustic model are similar to

the ones used in image recognition● Training image recognition models is

straightforward, as we know the true label of each image in the training data

● For speech, we only know the true transcript of each speech utterance, but we don’t know the true frame-by-frame phoneme sequence

● Therefore, usually we first train an old-school Gaussian Mixture Model based HMM that can produce the alignment between feature vectors and phonemes, given the transcript

● DNN can be now trained similarly to image recognition

○ Input: a window of acoustic feature vectors (image)

○ Output: phoneme corresponding to the center of the window (label)

● Nowadays this objective is often mixed with a more complicated sequence-level criterion

Language model● The acoustic model gives us the

probabilities that the feature vectors were “generated” by a certain sentence

● Language model gives us prior probabilities of different sentences

● Prior probabilities: probabilities of “hearing” a sentence, before the user even starts speaking

Rule-based LM● The simplest language model is a

rule-based model● Rule based LM typically assigns equal

probabilities to all “valid” sentences, and zero probability to all invalid sentences

E.g., a rule-based LM for controlling a robot:

● (go|move)

((one meter)|((two|three) meters)

(forward|backward)

● turn (right|left)

This allows sentences like “move one meter forward”, but not “move right”, or “two three”

Statistical language models● For natural (unconstrained) speech recognition, N-gram language models are

used● N-gram LMs make the assumption that word probability depends only on the

N-1 previous words

● Usually in speech recognition, N=3 or N=4. In case of trigrams (N=3):

● N-grams probabilities are found from big text corpus, and smoothed

N-gram LM, sampleHighest probability words in the given context are given in red

Decoding● Decoding goal: find the word sequence W

that maximizes:

● All models are integrated during decoding○ DNN acoustic model○ HMM transition probabilities○ Pronunciation lexicon○ Word N-gram probabilities

● Heavy pruning is needed to deal with large word vocabularies

○ Pruning drops “hopeless” branches early, in order to save computation

● Algorithmically quite complicated, if efficiency is needed

Recurrent neural network language model● Nowadays, recurrent neural network (RNN)

language model is often used for improving speech recognition performance

● RNN is computationally expensive to be used in the first pass of decoding

● Therefore, N-gram model is used during decoding

● However, the decoding pass generates a recognition lattice (a compact representation of most likely recognition hypotheses)

● RNN is then used for rescoring, i.e., to assign new LM scores to different paths through the lattice

● This reduces errors by 10-20% (relative)

Porting to new domains● Acoustic model (AM) and language model (LM) can be trained on different

data● This allows easy porting of speech recognition systems to new domains● For example, AM is usually trained on transcribed broadcast data, lecture

transcripts, dictated speech● LM however can be trained on domain-specific texts (no speech is necessary)● Example: we created a dictation system for the medical domain, using AM

trained on (mostly) broadcast speech, and LM trained on large amounts of medical texts

Estonian speech recognition● In Tallinn University of Technology, we

have been working on usable and practical Estonian ASR for about 15 years

● Accuracy depends a lot on the domain○ Spontaneous speech is the hardest, as the

language model is is mostly trained on written language

○ Noisy environments also introduce errors● Our system is free to use, open source

and available as a Docker container● Can be used for producing video subtitles,

e.g. https://youtu.be/9YQHNh3zeDs

Domain WER

Radio talkshows 9.1%

Speeches from a linguistics conference

13.7%

Spontaneous speech (studio-recorded)

17.6%

Recordings “from the wild” (real meeting and interview recordings with random devices in noisy environments)

25.0%

https://youtu.be/9YQHNh3zeDs

Thank you

Documents

Speech Recognition - ut