19
2001/03/29 Chin-Kai Wu, CS, NTHU 1 Speech and Language Speech and Language Technologies for Audio Indexing Technologies for Audio Indexing and Retrieval and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000

Speech and Language Technologies for Audio Indexing and Retrieval

Embed Size (px)

DESCRIPTION

Speech and Language Technologies for Audio Indexing and Retrieval. JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000. Outline. Introduction - PowerPoint PPT Presentation

Citation preview

Page 1: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 1

Speech and Language Speech and Language Technologies for Audio Technologies for Audio Indexing and RetrievalIndexing and Retrieval

JOHN MAKHOUL, FELLOW, IEEE,

FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG

NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE

PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000

Page 2: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 2

OutlineOutline

Introduction Indexing and Browsing with Rough’n’Ready

Rough’n’Ready System Indexing and Browsing

Statistical Modeling Paradigm Speech Recognition Speaker Recognition

Segmentation Clustering Identification

Page 3: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 3

IntroductionIntroduction Much of information will be in the form of speech from

various source.

It’s now possible to start building automatic content-based indexing and retrieval tools.

The Rough’n’Ready system provides a rough transcription of the speech that is ready for browsing.

The technologies incorporated in the system include speech/speaker recognition, name spotting, topic classification, story segmentation and information retrieval.

Page 4: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 4

Rough’n’Ready systemRough’n’Ready system

ActiveX controls

MP3

Dual P733-MHz

Collect/Manage Archive

Interact with browser

ActiveX controls

Page 5: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 5

Indexing and BrowsingIndexing and Browsing

Page 6: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 6

Indexing and Browsing Indexing and Browsing (Cont’d)(Cont’d)

Speaker

People

Place

Organization

Topic Labels

Page 7: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 7

Indexing and Browsing Indexing and Browsing (Cont’d)(Cont’d)

Selected from over 5500 topic labels

Page 8: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 8

Statistic Modeling ParadigmStatistic Modeling Paradigm

Maximize P(output|input, model)

(desired recognized sequence of the data)

Page 9: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 9

Speech RecognitionSpeech Recognition Statistic model: acoustic models, language

models

Acoustic model Describe the time-varying evolution of feature vectors

for each sound or phoneme Employ hidden Markov models (HMM) Gaussian mixture models the feature vector for each

HMM states Special acoustic models for nonspeech events: music,

silence/noise, laughter, breath, and lip-smack.

Language model: N-gram language model

Page 10: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 10

Speech Recognition (Cont’d)Speech Recognition (Cont’d) Multipass recognition search strategy

Fast-match pass Narrows search space Followed by other passes with more accurate models

operate on smaller search space

Backward pass Generate top-scoring N-best word sequences (100 <=

N <= 300)

N-best rescoring pass: Tree Rescoring algorithm

Page 11: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 11

Speech Recognition (Cont’d)Speech Recognition (Cont’d)

Speedup algorithms Fast Gaussian Computation (FGC) Grammar Spreading N-Best Tree Rescoring

Word error rate PII 450-MHz processor, 60000-word vocabulary 3 x RT => 21.4% 10 x RT => 17.5% 230 x RT => 14.8%

Page 12: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 12

Speaker RecognitionSpeaker Recognition

Speaker segmentation Segregate audio streams based on the speaker

Speaker clustering Groups together audio segments that are from the

same speaker

Speaker identification Recognizes those speakers of interest whose voices

are known to the system

Page 13: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 13

Speaker SegmentationSpeaker Segmentation Two-stage approach to speaker change

detection First: Detects speech/nonspeech boundaries Second: Perform actual speaker segmentation within

the speech segments

First stage Collapse the phoneme into three broad classes

(vowels, fricatives, and obstruents) Include five nonspeech models (music, silence/noise,

laughter, breath, and lip-smack) 5-states HMM Detection reliability over 90% of the time

Page 14: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 14

Speaker Segmentation Speaker Segmentation (Cont’d)(Cont’d)

Second stage Hypotheses a speaker change boundary at every

phone boundary located in the first stage Speaker change decision takes the form of a

likelihood ratio (λ) test

Nonspeech region

Speech region

λ <= t

λ > t

λ <= t + α

λ > t + α

Same speaker

otherwise

Page 15: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 15

Speaker ClusteringSpeaker Clustering

The likelihood ratio test is used repeatedly to group cluster pairs that are deemed most similar until all segments are grouped into one cluster and a complete cluster tree is generated

To find the cut of the tree that is optimal based on criterionK: number of clusters for any particular cut

of treeNj: number of feature vectors in cluster j

Log of determinant of the within-cluster dispersion matrix

Compensation for the previous term

Page 16: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 16

Speaker Clustering (Cont’d)Speaker Clustering (Cont’d)

The algorithm performs well regardless of the true number of speakers, producing clusters of high purity

The purity is defined as the percentage of frames that are correctly clustered, measured as 95.8%

Page 17: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 17

Speaker IdentificationSpeaker Identification

Every speaker cluster created in the speaker clustering stage is identified by gender

The gender of a speaker segment is then determined by computing the log likelihood ratio between the male and female models

This approach has resulted in a 2.3% error in gender detection

Page 18: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 18

Speaker Identification Speaker Identification (Cont’d)(Cont’d)

In the DARPA Broadcast News corpus, 20% of the speaker segments are from 20 known speakers

The problem is what is known as an open set problem in that the data contains both known and unknown speakers and the system has to determine the identity of the known-speaker segments and reject the unknown-speaker segments

Page 19: Speech and Language Technologies for Audio Indexing and Retrieval

2001/03/29 Chin-Kai Wu, CS, NTHU 19

Speaker Identification Speaker Identification (Cont’d)(Cont’d) The system resulted in three types of errors

False identification rate of 0.1%, a known-speaker segment was mistaken to be from another known speaker

False rejection rate of 3.0%, where a known-speaker segment was classified as unknown

False acceptance rate of 0.8%, where an unknown-speaker segment was classified as coming from one of the known speakers