30
Decoding Techniques for Automatic Speech Recognition Florian Metze Interactive Systems Laboratories

Decoding Techniques for Automatic Speech Recognition

  • Upload
    pearly

  • View
    83

  • Download
    0

Embed Size (px)

DESCRIPTION

Decoding Techniques for Automatic Speech Recognition. Florian Metze Interactive Systems Laboratories. Outline. Decoding in ASR Search Problem Evaluation Problem Viterbi Algorithm Tree Search Re-Entry Recombination. The ASR problem: arg W max p(W| x ). Two major knowledge sources - PowerPoint PPT Presentation

Citation preview

Page 1: Decoding Techniques  for  Automatic Speech Recognition

Decoding Techniques for Automatic Speech Recognition

Florian Metze

Interactive Systems Laboratories

Page 2: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 2Aug 14, 2002

Outline

• Decoding in ASR• Search Problem• Evaluation Problem• Viterbi Algorithm• Tree Search• Re-Entry• Recombination

Page 3: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 3Aug 14, 2002

The ASR problem: argW max p(W|x)

• Two major knowledge sources– Acoustic Model: p(x|W)

– Language Model: P(W)

• Bayes: p(W|x)P(x)=p(x|W)P(W)

• Search problem: argW max p(x|W)P(W)

• p(x|W) consists of Hidden Markov Models:– Dictionary defines state sequence: „hello“ = /hh eh l ow/

– Full model: concatenation of states (i.e. sounds)

Page 4: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 4Aug 14, 2002

Target Function/ Measure

• %WER = minimum editing distance between reference and hypothesis

• Example:– the quick brown fox jumps * over REF– * quick brown fox jump is over HYP– D S I ERRWER = 3/7 = 43%

• Different measure from max p(W|x) !!!

Page 5: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 5Aug 14, 2002

A simpler problem: Evaluation

• So far we have:– Dictionary: “hello” = /hh eh l ow/ …

– Acoustic Model: phh(x), peh(x), pl(x), pow(x) …

– Language Model: P(“hello world”)State sequence: /hh eh l ow w er l d/

• Given W and x:Alignment needed!

/ hh eh l ow /

Page 6: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 6Aug 14, 2002

A simpler problem: Evaluation

• So far we have:– Dictionary: “hello” = /hh eh l ow/ …

– Acoustic Model: phh(x), peh(x), pl(x), pow(x) …

– Language Model: P(“hello world”)State sequence: /hh eh l ow w er l d/

• Given W and x:Alignment needed!

/ hh eh l ow /

Page 7: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 7Aug 14, 2002

The Viterbi Algorithm

• Beam search from left to right

• Resulting alignment is best match given p?(x) and x

p(x) Time

1.2 1 1

1.2 1.3 1.1 1 1.1 1.2

1.3 1 1 1 1.2 1.3

1 1.2 1.2 1.1hh

eh

l o

w

Page 8: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 8Aug 14, 2002

The Viterbi Algorithm (cont‘d)

• Evaluation problem: ~ Dynamic Time Warping

• Best alignment for given W, x, and p?(x) by locally adding scores (=-log p) for states and transitions

p(x) Time

6 6 7

2.4 3.9 4.4 5 6.6 8.4

1.3 2 3 4 6 6.8

1 2.4 3.9 4.4hh

eh

l o

w

Page 9: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 9Aug 14, 2002

Pronunciation Prefix Trees (PPT)

• Tree Representation of the Search Dictionary

• Very compact fast!• Viterbi Algorithm also

works for trees

BROADWAY: B R OA D W EYBROADLY: B R OA D L IEBUT: B AH T

Page 10: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 10Aug 14, 2002

Viterbi Search for PPTs

• A PPT is traversed in a time-synchronous way• Apply Viterbi Algorithm on

– state level (sub-phonemic units: –b –m –e)

Constrained by HMM Topology

– phone levelConstrained by PPT

• What do we do when we reach the end of a word?

Page 11: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 11Aug 14, 2002

Re-Entrant PPTs for continuous speech

• Isolated word recognition:– Search terminated in leafs of the PPT

• Decoding of word sequences:– Re-enter the PPT and store the Viterbi path using a

backpointer-table

Page 12: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 12Aug 14, 2002

Problem: Branching Factor

• Imagine sequence of 3 words with 10k vocabulary– 10k ^ 3 = 1000G (potentially)

– Not everything will be expanded, of course

• Viterbi approximation path recombination:– Given P(Candy | „hi I am“) = P(Candy | „hello I am“)

hi I am

hello I amCandy

Page 13: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 13Aug 14, 2002

Path Recombination

At time t : Path1 = w1 .. wN with score s1

Path2 = v1 .. vM with score s2

Where: s1 = p(x1...xt | w1...wN )* P(wi| wi-1 wi-2)

s2 = p(x1...xt | v1 ...vM )* P(vi | vi-1 vi-2)

In the end, we‘re only interested in the best path!

Page 14: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 14Aug 14, 2002

Path Recombination (cont‘d)

• To expand the search space into a new root:– Pick the path with the best score so far (Viterbi

approximation)

– Initialize scores and backpointers for the root node according to the best predecessor word

– store the left context model information with the last phone from the predecessor(context-dependent acoustic models: /s ih t/ /l ih p/)

Page 15: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 15Aug 14, 2002

Problem with Re-Entry:

• For a correct use of the Viterbi algorithm, the choice of the best path must include the score for the transition from the predecessor word to the successor word

• The word identity is not known at the root level, the choice of the best predecessor can therefore not be done at this point

Page 16: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 16Aug 14, 2002

Consequences

1. Wrong predecessor words language model information only at leaf level

2. Wrong word boundaries The starting point for the successor word is

determined without any language model information

3. Incomplete linguistic information Open pruning thresholds are needed for beam search

Page 17: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 17Aug 14, 2002

Three-Pass search strategy

1. Search on a tree-organized lexicon (PPT)• Aggressive path recombination at word ends

• Use linguistic information only approximately

• Generate a list of starting words for each frame

2. Search on a flat-organized lexicon• Fix the word segmentation from the first pass

• Full use of language model (often needs a third pass)

Page 18: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 18Aug 14, 2002

Three-Pass Decoder: Results

• Q4g system with cache for acoustic scores:– 4000 acoustic models trained on BN+ESST

– 40k Vocabulary

– Test on “readBN” data

Search Pass Error Rate Real-time factor

Tree Pass 22.0% 9.6

Flat Pass 18.8% 0.9

Lattice Rescoring 15.0% 0.2

Page 19: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 19Aug 14, 2002

One-Pass Decoder: Motivation

• The efficient use of all available knowledge sources as early as possible should result in faster decoding

• Use the same engine to decode along:– Statistical n-gram language models with arbitrary n

– Context-free grammars (CFG)

– Word-graphs

Page 20: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 20Aug 14, 2002

Linguistic states

• Linguistic state, examples:– n-1 word history for statistical n-gram LM

– Grammar state for CFGs

– (lattice node, word history) for word-graphs

• To fully use the linguistic knowledge source, the linguistic state has to be kept during decoding

• Path recombination has to be delayed until the word identity is known

Page 21: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 21Aug 14, 2002

Linguistic context assignment

• Key idea: establish a linguistic polymorphism for each node of the PPT

• Maintain a list of linguistically morphed instances in each node

• Each instance stores its own backpointer and scores for each state of the underlying HMM with respect to the linguistic state of that instance

Page 22: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 22Aug 14, 2002

PPT with linguistically morphed instances

AH T

B

R OA DW EY

L IE

Typically: 3-gram LM, i.e. P(W) = iP(wi|Wi)P(wi|Wi) = P(broadway | „bullets over“)

Page 23: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 23Aug 14, 2002

Language Model Lookahead

• Since the linguistic state is known, the complete LM information P(W) can be applied to the instances, given the possible successor words for that node of the PPT

• Let

lct = linguistic context/ state of instance i from node n

path(w) = path of word w in the PPT

(n, lct) = min w {w | node n path(w)} P(w|lct)

score(i) = p(x1...xt | w1...wN)* P(wN-1|...) * (n, lct)

Page 24: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 24Aug 14, 2002

LM Lookahead (cont‘d)

• When the word becomes unique, the exact lm score is already incorporated and no explicit word transitions needs to be computed

• The lm scores will be updated on demand, based on a compressed PPT („smearing“ of LM scores)

• Tighter pruning thresholds can be used since the language model information is not delayed anymore

Page 25: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 25Aug 14, 2002

Early Path Recombination

• The Path recombination can be performed as soon as the word becomes unique, which is usually a few nodes before reaching the leaf. This reduces the number of unique linguistic contexts and instances

• This is particularly effective for cross-word models due the fan-out in the right context models

Page 26: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 26Aug 14, 2002

One-pass Decoder: Summary

• One-Pass decoder based on– One copy of tree with dynamically allocated instances

– Early path recombination

– Full language model lookahead

• Linguistic knowledge sources– Statistical n-grams with n >3 possible

– Context free grammars

Page 27: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 27Aug 14, 2002

Results

Real time factor Error rate

3-pass 1-pass 3-pass 1-pass

VM 6.8 4.0 26.9% 26.9%

readBN 12.2 4.2 14.7% 13.9%

Meeting 55 38 43.7% 43.4%

Page 28: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 28Aug 14, 2002

Remarks on speed-up

Speed-up ranges from a factor of almost 3 for the readBN task to 1.4 for the meeting data

Speed-up depends strongly on matched domain conditions

Decoder profits from sharp language models LM Lookahead less effective for weak language

models due to unmatched conditions

Page 29: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 29Aug 14, 2002

Memory usage : Q4g

Module 3-pass 1-pass

Acoustic Models 44 MB 44 MB

Language Model 87 MB 82 MB

Overhead 16 MB 16 MB

Decoder:

- permanent

- dynamic

120 MB

~100 MB

18 MB

~20 MB

Total 367 MB 180 MB

Page 30: Decoding Techniques  for  Automatic Speech Recognition

ESSLLI 2002, Trento 30Aug 14, 2002

Summary

• Decoding is time- and memory consuming• Search errors occur when beams too tight (trade-

off) or Viterbi assumption violated• State-of-the art: One-pass decoder

– Tree-structure for efficiency

– Linguistically morphed instances of nodes and leafs

• Other approaches exist (stack decoding, a-posteriori decoding, ...)