1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of

1

Broadcast News Segmentation using Metadata and Speech-To-Text

Informationto Improve Speech Recognition

Sebastien Coquoz,

Swiss Federal Institute of Technology (EPFL)

International Computer Science Institute (ICSI)

March 16, 2004

2

Outline

General Idea ASR System used Exploratory work Strategies Results Conclusion

3

General idea

Use Metadata (SUs) and Speech-To-Text (STT) information to improve later STT passes (feedback loop)

4

Why segmentation?Why segment the audio stream?

• Important to give « linguistically coherent » pieces to the language model

• Remove « non-speech » (i.e. long silences, laughs, music, other noises,…)

Why use MDE?

• MDE gives information about sentence and speaker breaks

• Speaker labels improve the efficiency of the acoustic model and sentences improve the efficiency of the language model

• BBN’s error analysis of Broadcast News recognition revealed a higher error rate at segments boundaries

this may be caused by missing the true sentence boundaries

5

Metadata and STT information

MDE object used:

Sentence-like units (SUs): express a thought or

idea. It generally corresponds to a sentence. Each SU has a

confidence measure, timing information (starting point and

duration) and a cluster label.

STT object used:

Lexemes: describe the words that were assumed to

be uttered. Each word has timing information (beginning

and duration).

6

ASR system used

The system used is a simplified SRI BN evaluation system.

Recognition steps:

1. Segment the waveforms

2. Cluster the segments into « pseudo-speakers »

3. Compute and normalize features (Mel cepstrum)

4. Do first pass recognition with non-crossword acoustic models and bigram language model

5. Generate lattices

6. Expand lattices using 5-gram language model

7. Adapt acoustic models for each « pseudo-speaker »

8. Generate new lattices using the adapted acoustic models

9. Expand new lattices using 5-gram language model

10. Score the resulting hypotheses

7

Types of segmentation

Baseline

• Classifies frames into « speech » and « non-speech » using a 2-state HMM

• Uses inter-words silences and speaker turns to segment the BN shows

MDE-based

• Uses sentence and speaker breaks to define an initial segmentation

• Further processes the segments using different strategies presented later

Baseline vs. MDE-based segmentation

8

Baseline experiments

Comments:

• The baseline segmentation is the one presented above

• The results (shown later) obtained are:

• the current best results

• the baselines that ultimately have to be improved

• No additional processing step is applied to modify the segments

9

« Cheating » experiments (1)

Why?

• See if there is room for improvement when using MDE-based segmentation

How?

• Use transcripts written by humans to segment the Broadcast News audio stream and apply processing strategies to improve recognition (i.e. use true information)

10

« Cheating » experiments (2)

Results: Baseline vs. « Cheating » experiments

WERBaseline

seg Cheating

seg(using SU)

Cheating seg

(SU+proc)

Wtd avg on 6

shows14.0 14.2 13.0

There is room for improvement!

11

Overview of the processing steps

Broadcast News Shows

0. Segmentation using SUs

1. First strategy: splitting of long segments

2. Second strategy: concatenation of short segments

3. Third strategy: addition of time pads

Final segmentation

12

First strategy: splitting of long segments

Why?

• Too long segments may cover more than 1 sentence confusing for the language model

How?

• Use automatically generated transcripts and MDE

• Too short segments mustn’t be processed bad for the efficiency of the language model

• Take two features into account for decision tree:

• The duration of segments

• The pause between words

13

Second strategy: concatenation of short segments

Why?

• Short segments are not optimal for the language model

• Short segments increase the WER because all their words are close to the boundaries (cf. BBN’s error analysis)

How?

• Take 3 features into account for decision tree:

• Pause between segments

• Sum of the duration of two neighbors

• Cluster label

14

Third strategy: Addition of time pads

Why?

• Prevent words from only being partially included

• Because the windowing in the front end has a scope of up to 8 frames (4 on each side) better to have enough padding

How?

• Take 1 feature into account for decision tree:

• The pause between segments

15

Examples of improvements (1)

1) Real sentence: … and strictly limits state authority over how and when water is used …

Recognized sentence:

With baseline segmentation (cuts in middle of sentence):

… and stricter limits data arty over how and when watery hues …

Legend: segmentation point

red errors

time

time

With MDE-based segmentation:… and strict_ limits state authority over how and when water issues …

time

16

Examples of improvements (2)

2) Real sentence: … I didn’t know if we would pull off the games. I didn’t know if this community

would ever rally around the Olympics again. …

Recognized sentence:

With baseline segmentation (doesn’t cut at end of

sentence):

… pull off the games that had not this community would ever rally around …

time

time

With MDE-based segmentation:

… pull off the game_ I didn’t know _ this community would ever rally around …

time

17

Results for the development set

WER Baseline seg

Step 0: SU seg

SU seg + step1

Wtd avg on 6

shows

14.0 14.4 14.2

SU seg + steps 1 & 2


& 3

14.0 13.3

The improvement is 0.7% absolute and 5% relative!

18

Results for the evaluation set

WER Baseline seg

Step 0: SU seg

SU seg + step1

Wtd avg on 6

shows

18.7 19.8 19.7



& 3

19.6 18.4

The improvement is 0.3% absolute and 1.6% relative!

19

Dev results vs. Eval results

Observations:

• No « cheating » information available for the eval not sure how well the SU detection is working

• Improvements from step 0 (SU segmentation) to final segmentation are similar for dev set and eval set: 1.1% absolute (7.6% relative) for dev set and 1.3% absolute (6.6% relative) for eval set SU information not optimized for eval

• Respective improvements are quite uneven for each show suggests that the strategies are show dependent, not channel dependent

20

Future work

• Further optimize the thresholds for the three strategies

• Find a representation to choose a specific value of the thresholds for each show individually (i.e. fully adapted the decision trees to each show)

• Use Metadata objects such as the confidence measure of each SU and diarization to further improve the strategies

21

Conclusion

• Development of a new segmentation method based on Metadata and Speech-To-Text information

• Use features given by MDE and STT information in decision trees for each processing step

• Results indicate the promiss of this approach

• Further developments still seem to have room for improvement

22

Acknowlegments

I would like to thank:• Prof. Bourlard & Prof. Morgan

• Barbara & Andreas

• Yang

• IM2 for supporting my experience

Documents

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of