A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classific

ation

Bin MA and Haizhou LI

Institute for Infocomm Research

Singapore

2 ACM SIGIR August 15-19, 2005 Bin MA

Agenda

• Spoken Document Classification & Related Works• Phonotactic-semantic Approach• Voice Tokenization with Acoustic Words• Bag-of-Sounds Representation• Language Identification Classifiers with SVM and LSA• Conclusion


Spoken Document Classification & Related Works

• Spoken Document Retrieval (SDR) is the task of retrieving excerpts from a large collection of spoken documents based on a user’s request. – Automatic spoken document classification (SDC) is an important

topic in SDR;– Conventionally approached by integrating automatic speech

recognition (ASR) technologies and text information retrieval (IR).

• Most SDC efforts so far have been devoted to two paradigms:– lexical-semantic– n-gram phonotactic


• lexical-semantic– Convert the spoken documents into text transcripts of lexical

words; – The transcripts are typically generated from a large vocabulary

continuous speech recognizer (LVCSR).– Text categorization (TC) techniques are then applied to the

automatic transcripts to derive semantic classes.

Homophone

Out-of-Vocabulary (OOV)

Multilinguality

The major limitations is its lexical choice.



• n-gram phonotactic– Use n-gram phonotactics, i.e. the rules governing the sequences

of allowable phonemes, instead of lexical words to represent the lexical constraints that are imposed by semantic domains;

– Enhance robustness against speech recognition errors.

Semantic Abstraction

Multilinguality

Its major shortcoming is not to exploit the global phonotactics in the larger context of a spoken document.



Phonotactic-semantic Approach

• Spoken document classification (SDC) is more complex than text categorization (TC).– In TC, we usually derive the lexical vocabulary from the running

text. – For spoken documents, an additional tokenization step is needed

to convert sound wave into a sequence of phonetic units, such as words or phonemes.

• Two issues: – the definition of tokenization unit, and– the choice of vocabulary.


• Definition of tokenization unit– Traditionally use the lexical word or phonemes in a specific

language.

– We propose to use a set of universal acoustic word (AW) -language independent, self-organized, and phoneme-like units.

– We treat the documents in all languages equally with the same set of AWs.

– AWs can be learned from a multilingual training corpus using a data driven approach.



• Choice of vocabulary– Use the bag-of-sounds statistics over AWs, instead of bag-of-

words over lexical words, to derive high level semantic characteristics from a spoken document.

– The bag-of-sounds concept is analogous to the bag-of-words paradigm originally formulated in the context of information retrieval (IR) and text categorization (TC).

– A spoken document is then represented by a high-dimensional vector derived from the statistics of term frequency.



Lexical constraint

Latent semantics

Outstanding problems

Lexical-semantic approach

Lexical word

bag-of-words vector

1.Homophone 2.OOV 3.Multilinguality

n-gram phonotactic approach

n-local phonotactics

1.Multilinguality 2.Semantic Abstraction

Phonotactic-semantic approach

n-local phonotactics

bag-of-sounds vector



• Three fundamental components for SDC

– A voice tokenizer, i.e. a speech recognizer front-end which segments a spoken documents into acoustic tokens;

– A statistical language model which captures statistics of semantic domain information;

– A classifier which categorizes a spoken document using the

statistical language model.



Agenda



word

phoneme

frame

Voice Tokenization with Acoustic Words


• Segment an utterance into Q consecutive segments in a maximum likelihood manner– minimizing an overall distortion with dynamic programming;

• Cluster all segments into T classes with k-means algorithm– speech segments in the same class are acoustically similar;

• Train one HMM for each class– establish T acoustic segment models to represent the overall

acoustic space of all languages.

Voice Tokenization – Acoustic segment modeling (ASM)


Voice Tokenization – Phonetically-bootstrapped ASM

• Add phonetic constraints in segmentation– use large amount of labeled speech data from few well studied

languages;– train language-specific phone models;– choose some models to form a set of T models for bootstrapping;

• Phonetically label the multilingual training utterances – use T models to decode all training utterances;– keep the recognized sequences as “true” labels;

• Re-train models– force-align and segment all utterances based on “true” labels;– group all speech segments of a specific label into a class;– use these segments to re-train an HMM.


Agenda



• Bag-of-sounds is analogous to the bag-of-words;

• AWs in the vocabulary with T acoustic tokens;

• A spoken document is described as a count vector of AWs, which has its element to represent the count of an AW and takes the AW vocabulary size W as dimension.

• Capture local phonotactics with lexical constraints;• Capture global phonotactics with co-occurrences of

AWs;

Bag-of-Sounds Representation

nW T


Agenda



Language Identification

• National Institute of Standards and Technology (NIST) 1996 Language Recognition Evaluation (LRE) database.

• 12 languages : Arabic, English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese.

• Linguistic Data Consortium (LDC) Callfriend corpus as the training data.– 40 30-minute conversations;– 12,000 30-second training sessions for each language.

• 1492 30-second speech sessions from 1996 NIST LRE database as the test data.


LM-L: French

Universal VT

LM-1: English

LM-2: Chinese

Language Classifier

spoken utterance

Hypothesized language

Language Identification


SVM Classifier with Feature Extraction

• SVMlight V6.01 from http://svmlight.joachims.org/• Work with a linear kernel SVM;

• Feature dimension

• L*(L-1)/2 pair-wise binary SVMs• The class that gains most of the winning votes

takes all.

2128; 16,384 AWsT W T


• Count-trimming (CT)– AWs that have very low frequency;– AWs that occurs in too few document.

• Mutual Information (MI)– Class membership– Particular AW’s presence

– MI indicates the contribution to semantic classification from an AW’s presence.

{ , }X x x { , }Y y y ( , )

( , ) ( , ) log( ) ( ),

p x yMI X Y p x y

p x p yx X y Y



• Separation Margin (SM)– SVM with a linear kernel – , while– Margin is inversely proportional to – Features with higher |aj| are more influential in determining the

width of the separation margin.

• Feature Weighting

( )T

f c ba c 1 2{ , , ..., }

Wa a a a

a

2 1/ 2

, , ,

1

( ) /( )w d w d w d

w W

c c idf w c

1, 2 , ,

{ , , ..., }T

d d d W dc c c c

( ) log / ( )idf w D d w



10

20

30

40

50

6010

0

500

2000

4000

6000

8000

1000

0

1200

0

1400

0

Acoustic Word Vocabulary Size

Err

or

Rate

(%

)

SM MI CT


SLID error rate comparison among three feature selection techniques



13

18

23

28

33

100 500 2000 6000 10000

Number of Training Sessions per Language

Err

or

Rate

(%

)

Effect of training corpus size


LSA Classifier with SVD

• Singular Vector Decomposition (SVD)– Term-document matrix :– SVD :

– Retain the top Q singular values in matrix S

• Latent Semantic Analysis (LSA)

:H W DTH USV

2

( , ) cos( , )|| || || ||

T

i j

i j i j

i j

v S vg c c v v

v S v S

1( , ) ( , ) cos ( , )

i j i j i jk c c k v v g c c

1T

p ppc v c US

1 2: ; : ; : ( ... )RU W R V D R S R R s s s


• LSA Classifier I – k-nearest neighbor

• LSA Classifier II – mixture modeling

ˆ arg min ( , )l l

p lll k v v

( , ) ( )|i j i j

k v v p v v

, ,

1

( | ) ( ) ( | )M

i l l m i l m

m

p v p v p v v

| |

1 1

( | ) ( | )l

DL

d l

l d

p p v

, ,

1

ˆ arg max ( | ) arg max ( ) ( | )M

p l l m p l m

l l m

l p v p v p v v



10

15

20

25

30

351 2 4 8 16 32 64 128

256

512

1024

Number of Mixtures

Err

or R

ate

(%)

Effect of Mixture Number M (LSAC-II)




#M 1,000 2,000 6,000 12,000

LSAC-I Error (%) 19.8 16.5 15.2 14.8

SVMC Error (%) 18.2 16.2 14.4 13.9

Effect of training data size in LSAC-I & SVMC

P-PRLMP-PRLM &

Score Fusion

LSAC_II SVMC

Error (%) 22.0 17.0 14.9 13.9Benchmark of different models


Conclusion

• Non-lexical approach to spoken document tokenization– Universal acoustic word (AW) - language independent, self-

organized, and phoneme-like units;– Data driven approach to learn from multilingual training corpus.

• Phonotactic-semantic paradigm to model – Local phonotactics in an acoustic word (AW); – Global phonotactics in an bag-of-sounds vector.


• Thank you !

Documents

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore