Robust Speech Recognition Algorithm Against Unknown Short-Time Noise By Arthur Chan Supervised by Prof. Manhung Siu Hong Kong University of Science and

Robust Speech Recognition Algorithm Against Unknown

Short-Time Noise

By Arthur Chan

Supervised by Prof. Manhung Siu

Hong Kong University of Science and Technology

Copyright © by Arthur Chan 2001

Outline• Robust Speech Recognition• HMM-based speech recognition in short-time noises.• Our Proposal : Skip the poor frames.

– Theory,– Implementation. FSVA and FSHMM

• Evaluation I : gaussian noise replacement• Improvement of FSVA• Evaluation II : Further evidences

– Additive short-time noise,– Short-time noise in GSM environment

• Conclusion and Future Work

Robust Speech Recognition

Speech Recognition

• Speech recognition– acceptable performance in matched

training and testing conditions.– Or the operating conditions is known in

training– Digit recognition (99%).– Dictation (90%).– Performance is still improving if the task is

under active research.

Mismatch Conditions

• The difference between training and operating (testing) enviroment.

• It exists.• For example,

– Simpler example• Sudden door slam when dictating a letter.

– In wireless environment,• The background of the speaker can change.

• Robust Speech Recognition is the study of building speech recognition that handle mismatch condition.

Mismatch Conditions (cont.)

• Why mismatch conditions are hard to deal with ? – There are so many causes of it.

• Additive noise (e.g. background noise such as air-conditioning)

• Channel noise (e.g. difference between microphones in training and testing conditions)

• Others : Lombard noise. Reflection of building.

– In general, noise can have• Random amplitude,• Random duration,• Random occurrence,• Random spectral characteristic.

Conventional Approach of Robust Speech Recognition

• E.g. Parallel Model Combination (PMC) (Gales, 95)– First collect some samples of noise in operating

environment,– Update acoustic model using the noise statistics,

• Work satisfactorily for stationary noise,• General time-varying noise cannot be

handled.

Short Time Noise

• Time limited Noise.

• Usually in operating environment, such as,– Door slam,– Click sound of keyboard,– Frame loss in network transmission of

speech.

Short Time Noise (cont.)

• In this work, we define short-time noise as,– Random spectral characteristic,– Random amplitude,– Random occurrence,– Random duration,– Shorter than the speech signal.

• Also known as partially temporal corruption (J. Ming, 2001).

• Some parts of speech is not corrupted.

This work

• Deal with short-time noise.• Some parts of speech is uncorrupted.• Using an interesting perspective,

– Can we ignore contributions of those corrupted frames in the decision making process?

– Supported by Missing Feature Theory. (Lipmann 97)

• We can regard those corrupted parts of speech as missing.• We can ignore those missing parts in decision.

HMM-Based Speech Recognition

Hidden Markov Model (HMM)

• Markov model with unobservable states sequence,• Can be used in other pattern recognition task.• Efficient algorithm for training and testing exists.• Example : Left-to-right HMM to model speech.

Viterbi Algorithm

• Efficiently search for most likely state sequence explains all observations.

)|(logmaxarg~

*OQPQ

QQ

• : An observation sequence, or .• : A state sequence, or .• : The set of all possible state sequence• : Best state sequence

OQ

*Q

Q~

),....,,( 21 Tooo

),....,,( 21 Tqqq

TO1TQ1

Viterbi Algorithm (cont.)

– Express in HMM’s parameters,

1

1 11

111

)(logloglogmaxarg

)()|(logmaxarg~

1*

*

T

tt

T

tqtqqq

QQ

TTT

QQ

oba

QPQOPQ

tt

Transition. Probability

Observation

Probability

Initial Probability

Viterbi Algorithm (cont.)

1

2

3

1

2

3

1

2

3

……….

T=1 T=2 T=3

• Efficient Implementation– At each state , at each time, define partial score,

),|(max)( 1111

1

iqQOPi ttt

Qt t

)(])(max[)( 1 tjijti

t obaij • Recursive Formula

Short-Time Noise in Viterbi Algorithm

)(loglog)|(11

111 t

T

tq

T

ttqtq

TT obaOQPt

• Finding the best state sequence,

• Finding the mean using the average,

N

nnxN 1

1 –E.g. Mean of 2.2,2.3,2.4,2.2 =2.275

– Mean of 2.2, 2.3, 2.4,100=26.275

• Easily affected by outlier frames.

Our Proposal

Our Proposal

• Search for most-likely state-sequence that ignores the most poorly performing K frames.

• Can be implemented efficiently– similar to Viterbi algorithm

• achieve satisfactory performance.

Robust Mean of 2.2, 2.3, 2.4,100

=(2.2+2.3+2.4)/3=2.3

Formulation : Ignore the poorest frame

• Try to ignore the frame with lowest likelihood. I.e.

1

11

)(maxarg~

1

1*

T

tt

ttqtqqq

QQobaQ

tt

• we have ranked order the frames in ).....( 1 Too to ).....(

1 Ttt oo

• Such that )()(11

iitiittqtq obob

Generalization : Ignore the poorest K frames

• The robust likelihood, is defined which skip the frames with lowest likelihood

),....{ 1

1,

1

1,

1

11

1

log)(log

log)(log

)|(

K

iit

it

tti

T

i

iitq

T

i

ii

T

Kiitq

TTK

aob

aob

OQ

-Still, we maintain the alignment information (transition term unchanged)

Generalization :

• Speech Recognition become the problem of finding a state sequence with best robust likelihood,

)|(maxarg

~111 *

TTK

QQ

T OQQ

Alternative Formulation• For every state sequence, consider all possible patterns of corruption of K frames among T frames.• Totally of them. Denote them as .• For each pattern, are the set of uncorrupted frames in this pattern • Pattern of corruption . E.g. of T=4, K=2 has following patterns of corruption.

– Frames 1 and 2,– Frames 1 and 3,– Frames 1 and 4,– Frames 2 and 3,– Frames 2 and 4,– Frames 3 and 4.

TKC ),......,,( 21 T

KClll

),......,,( 21 KTiiii oool

Alternative Formulation

• The robust likelihood, can be alternatively defined as,

1,

111

1,

11

11

log)|(logmax

log)|(logmax)|(

jj

T

j

ii

KT

j

C

i

jj

T

j

i

C

i

TTK

aqop

aQlpOQ

jj

TK

TK

• Extended Union Model probability (J. Ming)

1,

1111 log)|(log)|(

jj

T

j

i

C

i

TTK aQlpOQ

TK

Missing Feature Theory Interpretation

• The above formulation relates to Missing Feature Theory that suggests:– If a feature is corrupted, we can just ignore If a feature is corrupted, we can just ignore

itit– Example: Multi-band ASR assumes band

limited noise (frequency limited)– Similarly : Our Idea assumes noises are

short time in nature(time limited)

Direct Implementation

• Exhaustively neglect K frames for every state sequence– Very expensive,– For each state sequence, additions

are required,– Intractable for useful value of T and K

)( KTCT

K

Previous Attempts to tackle the Computation Burden

• Lets look at attempt deals with EUM• J.Ming et al (2001)

– N-Best re-scoring paradigm– An approximate model based on segment

(consecutive number of frames) is used.– Corruption in few frames is also regarded

as corruption of a whole segment.

• A more efficient algorithm is desirable.

Efficient Implementation of Viterbi Algorithm

that skips frames• Two approaches

– Topological-space expansion approach• using FSHMM.• using terminology similar to HMM.

– State-space expansion approach• Modify Viterbi algorithm directly.

Topological Space Expansion

• Frame-Skipping HMM (FSHMM)• Skipping state

– Consume one observation vector.– Generate a constant only.– Example:

1

Non-Skipping Version1

1

1_s

Skip State

Frame-Skipping Version

Left-to-right HMM (FS version)

Skipping State

NonSkip state

Skipt state

Implementation of TopologicalSpace Expansion Approach

• Memory usage (2N+1) times of Viterbi algorithm.

• Can be implemented with standard HMM software(e.g. HTK).

• Hard to be generalized to Continuous Word Recognition– A huge HMM need to be constructed

State-Space Expansion approach

• The general idea– Augment K scores when skip K frames.– In updates from previous skips, we ignore the

contribution of observation probability.– E.g.

Non-skipping version

1 2

3

1_0

1_1

2_0

2_1

3_0

3_1skipping version

ija

)( tjij oba

Update Formula

• We can prove the recursion for partial robust likelihood.• We can define the partial score (robust likelihood) of state j at time T with skips K as

))](),(),1,(max([max

)),(),,(max(

),(

11

11

jjttiji

skipnont

skipt

t

obkikia

kjkj

kj

Proof of Update Formula

))|(max),|(maxmax(

)|(max

)|(

11

11

|11

21

t

Ll

t

Ll

ti

C

i

ttk

QlpQlp

Qlp

QOtk

– are the set of corruption where the k-th frame is skipped

– are the set of corruption where the k-th not skipped

1L

2L

Proof of Update Formula (cont.)

))|(),()|(max(

))|(max),|(maxmax()|(

|11

111

|11

11

11|

1121

ttktq

ttk

t

Ll

t

Ll

ttk

QOobQO

QlpQlpQO

t

–If we check the cardinality (or size) of the two sets.

||||||

||

||,||

2121

,21

112

11

LLLL

CLL

CLCLtk

tk

tk

Pascal’s formula

Frame Skipping Viterbi Algorithm (FSVA)

• Transition probability can be easily incorporated in the above formula

• above update formula is called FSVA.

• Similar idea can be used to compute the probability of extended union model (EUM).

FSVA (cont.)

• Update Formula

))](),(),1,(max([max

)),(),,(max(

),(

11

11

jjttiji

nt

st

t

obkikia

kjkj

kj

Updated from

Skip k

Updated from

Skip k-1 e.g

Impatient Button

Implementation II (State-Space expansion approach)

• similar to exact N-Best Algorithm,

• Memory usage: N Times normal Viterbi,

• With caching of observation probabilities, computation will be quite similar to normal Viterbi .

Evaluation I:Gaussian Noise Replacement

Evaluation I(Objective)

• To determine the usefulness of FSVA.

Evaluation I(Conditions)

• Baseline– Corpus : TIDIGITS(adults) train 8668, test

8668– Training 12 MFCCs + delta +delta delta

+energy = 39 features– Testing results

• 99.72 (Isolated Digit Recognition),• 98.90 (Connected Digit Recognition) (Un-

tuned)

Evaluation I(Conditions)(cont.)

• Corruption is simulated– 10% of frames in testing utterance is

skipped and replaced by a frame , which is• gaussian noise• Constant energy level

– A clean model is used to test – Testing results using left-to-right HMM

• 85.34%(Isolated Digit Recognition), • 78.83%(Continuous Digit Recognition)

Experiment I(Results)

• Using FSVA

But : We are not happy!

-Degrade in clean speech.

-Hard to determine what is best skip if the condition is unknown

Acc Skip

CDR Clean 98.97 2

CDR Noisy 93.71 28

IDR Clean 98.47 20

IDR Noisy 99.76 2

70

75

80

85

90

95

100

1 7 13 19 25 31 37 43 49 55

CDR noisy(0.1)

CDR clean

IDR noisy (0.1)

IDR clean

IDR: +88%

CDR: +70%

How much corrupted frames are skipped? -An Analysis

• Define – : All Frames.– : Set of corrupted frames.– : Set of uncorrupted frames.– : Set of detected frame or hit frames.

• Then likelihood ratio is found to be

• We skip mostly corrupted frames.

ACU CA /H

10)|(

)|(

UHP

CHP

How much can be gained from FSVA? – 2nd Analysis

• Performance of FSVA using skips which gives lowest WER for each sentence– 99.72 (Isolated), 97.66 (Continuous)

• Still room for improvement– Longer sentences require more skips to recover

• E.g (Observed from data)111.wav

-SIL 1 1 SIL (from skip 1 to 5)

-SIL 1 1 1 SIL(from skip 6 to 29)

-SIL 3 1 1 SIL(from skip 29 to 57)

….

24z982z.wav

-SIL 2 z o 9 8 2 o SIL (from skip 1 to 4)

-SIL 2 4 z o 9 8 2 o SIL(from skip 5 to 22)

-SIL 2 4 z o 9 8 2 z o SIL(from skip 23 to 36)

-SIL 2 4 z o 9 8 2 z SIL (from skip 37 to 57)

….

Observations from Evaluation I

• It is difficult to determine the number of skips because of two factors,– The condition is unknown (rate of

corruption).– The length of sentence is unknown,

• Memory issue : N-times of standard Viterbi algorithm

Improvements of FSVA

Improvements of FSVA :

• We present the solutions of the skip determination problem,– Skip determination

• An automatic skip determination mechanism is presented.

– Memory problem is related to skip determination

• An approximate algorithm is presented• Preliminary result is presented.

Improvements of FSVA:Automatic Skip Determination

• This is hard problem, depends on– Length of utterance– Rate of corruption

• In known corruption rate and length of corruption– skipping fixed number of frames may be the most

intuitive.

• In general, these conditions are unknown– Ideally, we seek for method requires no prior

knowledge of the environment.


(cont.)• Idea (Log Likelihood Ratio Thresholding

(LLRT))– Stop the skipping process by testing the ratio of

likelihood.

• Why does it work?– In general, the robust likelihood is increasing

against K.

– Because, we decimate one more frame contribution in criterion function

))1(~

|())(~

|( 1 KQOKQO KK

))(~

|())1(~

|(1 KQOKQO KK


(cont.)• The improvement

– A likelihood ratio – Generally decreasing

• It suggests we can stop skipping if the ratio > certain threshold c

Cont.• Can be done very efficiently

– We can easily generate multi solutions.

Non-Skipping Version

Skipping Version

Start backtracking here

Evaluation of LLRT in gaussian noise replacement

• It works.– Undegraded in clean

condition– Improved in noisy

condition– Single value works for all

conditions. E.g. c=90

BL LLRT

Clean 98.90 98.98

Noisy 78.33 95.61

Discussion

• In LLRT, the threshold c– Effectively means the minimum likelihood of the

clean frames.– Success in LLRT suggests

• Skipping frames with likelihood smaller than c.• Simplified Frame-skipping Viterbi algorithm (SFSVA)• Update formula can be expressed as

, if else.

cobobct

ttob )()({)(ˆ

Simplified FSVA : Preliminary Evaluation

• At c=90

BL FSVA+

LLRT

SFSVA

Clean 98.90 98.98 98.86

Noisy 78.33 95.61 95.61

• Comparable Performance as FSVA+LLRT.

Evaluation II : Further Evidences

Evaluation II

• Previous Experiment in Evaluation I– Fixed spectral content (gaussian noise)– Fixed amplitude– Fixed duration ( 1 frame)– Replacement noise– Not general enough.

• Experiment 1 : additive short-time noise– With varying spectral content, amplitude, duration and

occurrence. • Experiment 2 : GSM environment (replacement noise)

– Replacement with comfort noise– Similar to speech in this case.

Experiment 1 (Setup)

• Train set is the same as Evaluation I• Additive short-time noise.

– Randomly pick up frames from 7 types of noises such as ring-tone, ICQ message.

– Controlled by 3 factors,• Amplitude (SNR),• Duration (L),• Rate of corruption (C).

• FSVA + LLRT is used in evaluation.

Experiment 1 (Results)

• Changing amplitudes, C=20%, L=1

SNR BL LLRT(opt.)

98.90 98.99(102)

10 98.62 98.67(106)

0 97.57 97.99(102)

-10 84.04 91.89(94)

Experiment 1 (Results) (cont.)

• Changing rate of corruptions, SNR=-10dB, L=1

Rate BL LLRT(opt.)

20% 84.04 91.89(94)

30% 69.21 82.96(94)

40% 56.39 71.15(94)


• Changing length of corruptions. SNR=-10dB, C=20%

Length BL LLRT(opt.)

1 84.04 91.89(94)

2 87.33 91.99(100)

3 90.24 94.07(94)

4 93.30 96.22(92)

5 95.17 97.09(94)

6 95.69 97.31(92)


• Average performance.

• Outperform baseline in wide range of c

• In [90,100]– Close to optimal

performance.

Experiment 1 (Summary of Results)

• FSVA + LLRT works in all conditions,– Undegraded result in SNR >0dB– Outperforms Viterbi algorithm in other cases

• Does it necessary to use the optimal threshold?– No.– A large range of values of c outperforms Viterbi

algorithm– A large range of values of c can be used such that,

• Closed to optimal result• Tuning in single condition only.

Experiment 2 (Comfort Noise Generation)

• GSM codec (GSM 06.10)– Regular Pulse Excited – Long Term Prediction (RPE-

LTP)– Linear Predictive Analysis and Synthesis

• Residual coefficients is important• Comfort Noise Generation (GSM 06.11)

– 1st frame : replace from last good frame– 2nd frame to 16th frame : decrease the magnitude of

residual coefficients of 1st frame– 16th + frames : predefined “silence” frame is substituted

• The generator cannot deal with frame loss with long duration.

Experiment 2 (Setup)

• Using AURORA database.– Down-sampled version of TIDIGITS.– 8008 training utterances.– 4004 testing utterances.

• Baseline result– Train(GSM coded) on Test (GSM coded),

98.64% (<98.90%)

Experiment 2 (Frame Loss Condition)

• Experiment in Noisy condition– 1%~2% of frames are corrupted– All skip position are known for the comfort

noise generator.– Comfort noise generation is done before

speech recognition.– 2 factors is controlled

• Rate of corruption• Length of corruption

Experiment 2 (Results)

D C BL LLRT(opt.)

1 1% 98.03 98.12(104)

2 1% 96.47 97.40(104)

3 1% 96.10 97.19(98)

1 2% 97.71 97.95(106)

1 5% 96.31 97.20(98)

1 10% 92.98 95.33(98)

Experiment 2 (Average Performance)

Experiment 2 (Summary of Results)

• Corruptions with 1 frame can be handled by comfort noise generator

• FSVA still has market value– When length of corruption > 1– When rate of corruption increase– After all, no degradation even in D=1

Conclusion and Future Work

Contribution of this work• FSVA – Frame Skipping Viterbi Algorithm

– found to be theoretically interesting– can be easily and efficiently implemented– good results in simulated noise

• Search technique can be applied in fast computation of Extended Union Model(EUM).

• LLRT – Log Likelihood Ratio Thresholding– Automatically determine no. of skips for FSVA.

• Preliminary study of SFSVA – simplified FSVA– Same amount of memory as Viterbi algorithm– Comparable improvement as FSVA + LLRT

Impact of this work

• HMM has wide range of applications in pattern recognition, digital communication.– FSVA can be used to deal with time-limited

(or space-limited) corruption in these applications

Future Work• Other possibilities implied from MFT.

– Don’t ignore, but impute.– When should we ignore a frame? When should we

impute it?• Combination of FSVA and Model-compensation technique

– Deal with general additive noise• Automatic Skip Determination : Any other combination

schemes?– E.g. Rover w/ confidence and voting?

• Evaluation in comfort noise generator of other codec.– E.g. Voice Over IP (VoIP)

• Extend FSVA to applications which applied HMM.

Thanks for your patience !

Q & A

3, Have you tried your algorithm in Aurora?

• Yes! We tried on AURORA II• But, FSVA doesn’t work because

– Most of the noise are additive noise• E.g. Street noise• E.g Babble noise

– The database is designed for Feature Extraction

4, It is hard to get clean speech corpus. How do you solve this?

• Our paradigm assume– Train in clean speech– Test in noisy speech

• A complementary method (Not yet succeed)– Train in noisy speech– Test in clean speech– Difficult because multiple mixture paradigm is hard

to beat.

5, Can we incorporate burst corruption in FSVA?

• It is possible but not elegant.

Burst skip

stateSkip state

6, Relation between Noise Composition?

• Not yet thoroughly understand

• Decompose FSHMM will result

7, How about Null Node?

• This is a little bit tricky.• Skip state is a real state.• Null state cannot result in skipping of a frame,

– Because no frame is consumed!

1_0

1_1

2_0

2_1

3_0

3_1skipping version

Null Nodes

8, Have you consider any real-life examples of additive noise?

• Yes!• Not presented in thesis and presentation.• We have tested on machine gun noise in NOISEX-92.• Results : 0.7% absolute gain or no gain• Cause: machine gun noise in NOISEX-92 corrupts all speech

frames. – Better to regard it as additive– Recording is done when the man is continuously shoot for

several minutes. (Can this be real?)– Positive result was obtained if the additive noise component

is removed.– Not reported because it may not be easily accepted by the

community.

10, Examples of extending this idea to other applications?• Yes.

11, How could this idea can be used in convolution

coding?

12, What is your plan on combining the other

techniques with FSVA?

13, Do you really think short-time noise should always

shorter than speech?• There is an intrinsic difficulty to define short-

time noise.• Dictionary of technology always characterize

short-time noise as– Random spectral content,– Random amplitude,– Random occurrence.

• No characterization in terms of length.• The length of speech may be the basic norm

for the length of noise.

14, How do you compare this with other similar techniques?• As we have mentioned,

– There is another technique called EUM search.

15, Actually, what makes FSVA works?

• Sorry!

• This is a problem we do not thoroughly understand

• Some strange results we obtained

• Hypothesis: partially corrupted frames.

17, Why do you keep the transition probability in your

formulation?• In theory, we can also ignore the

transition contribution. However,– Changing the transition means breaking

the word apart.– It would be disastrous if a phone is deleted

or distorted.

Topological Expansion of SFSHMM

Documents

Robust Speech Recognition Algorithm Against Unknown Short-Time Noise By Arthur Chan Supervised by Prof. Manhung Siu Hong Kong University of Science and