Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Automatic Speech Recognition (ASR)● Technology for converting speech to text● Speech recognition doesn’t mean that a
computer “understands you”● Applications
○ Dictating text○ Transcribing large multimedia collections
■ In order to make them indexable, more accessible
○ Human-computer interaction■ Android Assistant, Siri, Amazon Echo,
etc■ Needs many other technologies
besides speech recognition
Why is speech recognition difficult?
● Developing a speech recognition system is complicated, as it requires expert knowledge from various fields
● Speech recognition is difficult for computers because○ Human speech has a lot of variety○ Human speech contains many “mistakes” that a human brain can auto-correct, using world
knowledge○ Human speech is not (only) about recognizing which phonemes were spoken
■ Famous example: “it’s easy to recognize speech” vs “it’s easy to wreck a nice beach”■ Estonian example: /kas:a/ -> “kassa”, “kas sa” or maybe “Gassa”?
○ Computationally complex: for each input utterance, computer has to find the most probable sequence of words out of all possible word combinations
Sources of variability● Acoustic conditions
○ Room acoustics, background noise, other speakers in the background, distance from mic
● Microphone○ Type, directional characteristics, electrical noise, ADC quality, telephone
● Speaker○ Anatomy (size of the speech organs, gender, age)○ Health, pronunciation ○ Social background○ Dialect, linguistic background
● Speaking style○ Dictation○ Spontaneous speech between humans
● Phonetic context: the same phoneme sound differently in different phonetic contexts
Speech signal● Sound is a waveform of changing air
pressure● Speech is a sound produced by speech
organs● Microphone converts air pressure
modulation to voltage modulation● Analog-to-digital converter converts the
continuous voltage signal to digital signal, by sampling the value of the signal after perioding intervals
● Sampling frequency (samples per second):
○ Telephone: 8000 Hz ○ CD: 44100 Hz○ For speech, 16000 Hz is usually sufficient
Feature extraction● Raw audio signal contains a lot of
information (typically 16000 values every second)
● The goal of feature extraction:○ Reduce the amount of information○ Extract information that is important for
distinguishing between speech sounds○ Be robust against noise, channel
distortions, speaker variation
Spectrogram● If we turn the spectrogram 90 degrees up, ● represent the values using grayscale, ● and concatenate the spectrums of all
frames● Then we will get a spectrogram● There are people (phoneticians) who can
„read” the spectrogram, i.e., roughly understand what was said
Filterbanks● Human hearing is less sensitive to higher
frequencies — thus human perception of frequency is nonlinear
● Also, spectrogram contains still too much information
● That’s why, mel-scale filterbanks are used to discretize the spectorgram
● Mel-scale: use narrower bands in lower frequencies and wider bands in upper frequencies
● Each filter collects energy from a number of frequency bands in the spectogram
Features● Result: a speech signal (e.g., for an
utterance) is transformed into a sequence of fixed dimensional (typically 40-dimensional) feature vectors
● The length of the sequence varies (based on the length of the input utterance)
● This representation is similar to document representation using word embeddings!
● Many familiar models from NLP (convolutional neural nets, LSTMs) can be also applied for speech!
Speech recognition: metrics● The most common quality metrics for speech recognition is word error rate
(WER)● Distinguishes between 3 types of errors:
○ Insertion (I)○ Deletion (D)○ Substitution (S)
● To find the errors, reference and hypothesis texts are aligned, using a technique called dynamic programming
● WER=(S+D+I)/N , where N is the number of words in the reference
Word error rate, example● Example (** is just a placeholder):
Scores: (#C #S #D #I) 11 1 1 0
REF: aga ma olen aru saanud ET sel nädalal lugedes siin SISEMINISTRI intervjuusid ja
HYP: aga ma olen aru saanud ** sel nädalal lugedes siin SISEMIST intervjuusid ja
● S=1, D=1, I=0, N=13
● WER=(1+1+0)/13=15.3%
End-to-end speech recognition● Model: Listen, Attend and Spell (LAS)● Very similar to neural machine translation● LAS transcribes speech utterances directly
into characters● Consists of two submodules: the listener
and the speller○ The listener is an acoustic model encoder○ The speller is an attention-based character
decoder
LAS: Listener● Input to the Listener is a sequence of ● filterbank features● The Listen model uses many layers of
bidirectional LSTMs with a pyramidal structure
○ So that the model converges faster● In each layer, two consecutive outputs of
the lower layer are concatenated and processed by the BLSTM
○ The pyramidal structure also reduces computational cost
LAS: Attend and Spell● The distribution over output characters y
i
is a function of the decoder state si and context c
i
● The decoder state si is a function of the
previous state si-1
, the previously emitted character y
i-1, an the previous context c
i-1
● The context vector ci is produced by an
attention mechanism from listener’s states h1..v
Attention visualization● On the right: Alignments between
character outputs and audio signal produced by the LAS model for the utterance “how much would a woodchuck chuck”.
● The content based attention mechanism was able to identify the start position in the audio sequence for the first character correctly
● The alignment produced is generally monotonic
Problem with end-to-end speech recognition● End-to-end speech recognition works well if there are thousands of hours of
training data available● Training data is speech with manual transcriptions -- expensive to produce
○ E.g., for Estonian, we use around 300 h of training data
● In order to improve speech recognition, we have to use inductive bias○ That is, use our knowledge/assumptions about how language works, in order to constrain the
freedom of the model
The Noisy Channel Model
● Search through space of all possible sentences● Pick the one that is most probable given the waveform
Noisy Channel, cont.● What is the most likely sentence out of all
sentences in the language L given some acoustic input (feature vectors) O?
● Treat acoustic input O as sequence of individual observations O = o
1,o
2,o
3,…,o
t
● Define a sentence as a sequence of words:W = w
1,w
2,w
3,…,w
n
● Probabilistically: pick W with the highest probability
● We can use the Bayes rule to rewrite this:
● Since denominator is the same for each candidate sentence W, we can ignore it for the argmax (since we only care about finding the sentence with maximum probability, not the actual probability):
Markov models● Markov models resemble weighted finite
state automata● Markov model has a set of states,
transition probabilities between the states (and self-loops), and and special start/end states
Hidden Markov models● Hidden Markov models add two things two
the simple Markov model:○ Observation symbols○ Observation likelihoods for each state
Jason Eisner task● Toy task for understanding HMMs● You are a climatologist in 2799 studying
the history of global warming● You can’t find records of the weather in
Baltimore for summer 2006● But you do find Jason Eisner’s diary● Which records how many ice creams he
ate each day.● Can we use this to figure out the weather?
○ Given a sequence of observations O, ■ each observation an integer =
number of ice creams eaten■ Figure out correct hidden sequence
Q of weather states (H or C) which caused Jason to eat the ice cream
●
HMMs for speech recognition● Phones are not homogeneous● Therefore, each phone is modeled by a
three-state (or two-state) HMM● The states correspond to subphones
(the start state, middle state, end state)● Transition probabilities: how likely is that
the subphone takes a self-loop or goes to the next subphone (sort of duration model)
● Observation likelihoods: how likely each subphone is to generate a certain feature vector
● That is, instead of discrete elements (number of ice-creams), the outputs are 40-dimensional vectors
Acoustic modeling with HMMs● HMMs are used to model different
phonemes● Pronunciation lexicon maps each word in
the language model to a sequence of basic phonemes, e.g.:…ONE W AH N
TWO T UW
THREE TH R IY
...
● Word HMMs are synthesized from phoneme HMMs based on the pronunciation lexicon
● Decoding: find the path that most likely “produced” the feature vector sequence
Context-sensitive phones● Phonemes sound very different, depending on the context
○ Especially start and end○ E.g. /ae/ in word “sat” is different from /ae/ in word “man”
● Therefore we model words using triphones (or biphones)○ Triphone is “phone in a certain context”○ “sat” is synthesized from triphones /sil-s+ae/, /s-ae+t/, /ae-t+sil/
● Each triphone is modeled using three states● In order to reduce the number of different states, states of the same
phoneme is similar contexts are tied together (modelled by the same parameters)
● This gives more training data for each tied state
Context-sensitive phones, cont.● Triphone state clustering is usually done
using a decision tree● Decision tree is learned from data● The resulting tied states are called
senones● Typically, we want to have around 5000
senones in our model (more if we have a lot of training data)
Connecting speech features and HMMs● In hybrid DNN-HMM speech recognition,
the output distribution of HMM states is parametrized by a DNN
● DNN uses a stack of neighbouring acoustic feature vectors as input
● And calculates the probability over the HMM states as output
DNN in the acoustic model● The DNN in the acoustic model has a
similar role as a DNN in image recognition● Given (typically 50) neighbouring acoustic
feature vectors (an image), decide what phoneme state (e.g., beginning of /a/) is in the centre
● The DNN produces a distribution over all phoneme states, given a stack of feature vectors, i.e.,
Flipping probabilities● Since HMM is a generative model, we need to know the probability that a certain phone
state generates a certain feature vector
● However, DNN is discriminative, meaning it gives us the probability that a certain feature vector corresponds to a certain state
● Bayes rule helps to transform the latter to the first:
● That is, we need to divide the outputs of the DNN by the prior probability of states (it’s just their frequency in training data)
DNN architectures for speech recognition● Much of the research in speech
recognition is just looking for DNN architectures that work best for this task
● Currently one of the best working DNN architectures for acoustic modeling are
○ Time-delay neural network (TDNN) = many layers of dilated 1D convolutions
○ The above combined with (bidirectional) LSTM
■ Update: LSTM is nowadays often left out
○ Typically around 5-10 hidden layers
Training DNN models for speech recognition● DNNs used in the acoustic model are similar to
the ones used in image recognition● Training image recognition models is
straightforward, as we know the true label of each image in the training data
● For speech, we only know the true transcript of each speech utterance, but we don’t know the true frame-by-frame phoneme sequence
● Therefore, usually we first train an old-school Gaussian Mixture Model based HMM that can produce the alignment between feature vectors and phonemes, given the transcript
● DNN can be now trained similarly to image recognition
○ Input: a window of acoustic feature vectors (image)
○ Output: phoneme corresponding to the center of the window (label)
● Nowadays this objective is often mixed with a more complicated sequence-level criterion
Language model● The acoustic model gives us the
probabilities that the feature vectors were “generated” by a certain sentence
● Language model gives us prior probabilities of different sentences
● Prior probabilities: probabilities of “hearing” a sentence, before the user even starts speaking
Rule-based LM● The simplest language model is a
rule-based model● Rule based LM typically assigns equal
probabilities to all “valid” sentences, and zero probability to all invalid sentences
E.g., a rule-based LM for controlling a robot:
● (go|move)
((one meter)|((two|three) meters)
(forward|backward)
● turn (right|left)
This allows sentences like “move one meter forward”, but not “move right”, or “two three”
Statistical language models● For natural (unconstrained) speech recognition, N-gram language models are
used● N-gram LMs make the assumption that word probability depends only on the
N-1 previous words
● Usually in speech recognition, N=3 or N=4. In case of trigrams (N=3):
● N-grams probabilities are found from big text corpus, and smoothed
Decoding● Decoding goal: find the word sequence W
that maximizes:
● All models are integrated during decoding○ DNN acoustic model○ HMM transition probabilities○ Pronunciation lexicon○ Word N-gram probabilities
● Heavy pruning is needed to deal with large word vocabularies
○ Pruning drops “hopeless” branches early, in order to save computation
● Algorithmically quite complicated, if efficiency is needed
Recurrent neural network language model● Nowadays, recurrent neural network (RNN)
language model is often used for improving speech recognition performance
● RNN is computationally expensive to be used in the first pass of decoding
● Therefore, N-gram model is used during decoding
● However, the decoding pass generates a recognition lattice (a compact representation of most likely recognition hypotheses)
● RNN is then used for rescoring, i.e., to assign new LM scores to different paths through the lattice
● This reduces errors by 10-20% (relative)
Porting to new domains● Acoustic model (AM) and language model (LM) can be trained on different
data● This allows easy porting of speech recognition systems to new domains● For example, AM is usually trained on transcribed broadcast data, lecture
transcripts, dictated speech● LM however can be trained on domain-specific texts (no speech is necessary)● Example: we created a dictation system for the medical domain, using AM
trained on (mostly) broadcast speech, and LM trained on large amounts of medical texts
Estonian speech recognition● In Tallinn University of Technology, we
have been working on usable and practical Estonian ASR for about 15 years
● Accuracy depends a lot on the domain○ Spontaneous speech is the hardest, as the
language model is is mostly trained on written language
○ Noisy environments also introduce errors● Our system is free to use, open source
and available as a Docker container● Can be used for producing video subtitles,
e.g. https://youtu.be/9YQHNh3zeDs
Domain WER
Radio talkshows 9.1%
Speeches from a linguistics conference
13.7%
Spontaneous speech (studio-recorded)
17.6%
Recordings “from the wild” (real meeting and interview recordings with random devices in noisy environments)
25.0%