A Comparative Study on the Effect of Different Codecs on ... · A Comparative Study on the E ect of...

Preview:

Citation preview

A Comparative Study on the Effect of Different Codecson Speech Recognition Accuracy Using Various Acoustic

Modeling Techniques

Srinivasa Raghavan1, Nisha Meenakshi G1, Sanjeev Kumar Mittal1,Chiranjeevi Yarra1, Anupam Mandal2, K.R. Prasanna Kumar2,

Prasanta Kumar Ghosh1

1SPIRE LAB, Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India,2Center for AI and Robotics, Bangalore, Karnataka, India

x February 2017

SPIRE LAB, IISc, Bangalore 1

Introduction

Section 1

1 Introduction

2 Previous Works

3 Experiments

4 Results

5 Conclusion

SPIRE LAB, IISc, Bangalore 2

Introduction

Focus

A Comparative Study on the Effect of Different Codecs on SpeechRecognition Accuracy Using Various Acoustic Modeling Techniques.

SPIRE LAB, IISc, Bangalore 3

Introduction

Speech Coding & Automatic Speech Recognition (ASR)

SPIRE LAB, IISc, Bangalore 4

Note

1 The Channel Effect is not considered.

2 Effect of Language Model is not considered.

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZEDPHONEME

ASR with Codec Distorted Input

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Recognition

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Coding

Introduction

Speech Coding & Automatic Speech Recognition (ASR)

SPIRE LAB, IISc, Bangalore 4

Note

1 The Channel Effect is not considered.

2 Effect of Language Model is not considered.

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZEDPHONEME

ASR with Codec Distorted Input

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Recognition

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Coding

Introduction

Speech Coding & Automatic Speech Recognition (ASR)

SPIRE LAB, IISc, Bangalore 4

Note

1 The Channel Effect is not considered.

2 Effect of Language Model is not considered.

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZEDPHONEME

ASR with Codec Distorted Input

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Recognition

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Coding

Introduction

Speech Coding & Automatic Speech Recognition (ASR)

SPIRE LAB, IISc, Bangalore 4

Note

1 The Channel Effect is not considered.

2 Effect of Language Model is not considered.

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZEDPHONEME

ASR with Codec Distorted Input

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Recognition

ENCODER DECODERCHANNEL

CODEC TYPES:WAVEFORM

PARAMETRICHYBRID

FEATUREEXTRACTION

ACOUSTIC MODELING TECHNIQUES (AMT)

GMM-HMMSGMM

DNN

ACOUSTIC MODEL

+LANGUAGE

MODEL

RECOGNIZED PHONEME

Speech Coding

Introduction

Common Speech Coders

SPIRE LAB, IISc, Bangalore 5

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

VoIPG.711AG.729AG.729BSPEEXADPCM

MobileG.728MELP

GSM-8k

WirelessAMR-WBAMR-NB

Speech Coding

Introduction

Common Speech Coders

SPIRE LAB, IISc, Bangalore 5

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

VoIPG.711AG.729AG.729BSPEEXADPCM

MobileG.728MELP

GSM-8k

WirelessAMR-WBAMR-NB

Speech Coding

Introduction

Common Speech Coding Strategies

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

Single Encoding-Decoding

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

Tandem Encoding-Decoding

SPIRE LAB, IISc, Bangalore 6

Introduction

Common Speech Coding Strategies

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

Single Encoding-Decoding

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

Tandem Encoding-Decoding

SPIRE LAB, IISc, Bangalore 6

Introduction

Problem statement

FEATUREEXTRACTION

ACOUSTIC MODEL

RECOGNIZEDPHONEME

DATA AMT

CLEAN/ DISTORTED

DUE TO CODECS

Problem statement

What is that specific codec trained acoustic model, that performs wellfor different types of input speech (coded or clean PCM) across differentAMTs? Robust to codec induced distortions.

SPIRE LAB, IISc, Bangalore 7

Introduction

Problem statement

FEATUREEXTRACTION

ACOUSTIC MODEL

RECOGNIZEDPHONEME

DATA AMT

CLEAN/ DISTORTED

DUE TO CODECS

Problem statement

What is that specific codec trained acoustic model, that performs wellfor different types of input speech (coded or clean PCM) across differentAMTs? Robust to codec induced distortions.

SPIRE LAB, IISc, Bangalore 7

Introduction

Problem statement

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C

ACOUSTIC MODEL BUILT ON

DATA

?

Single Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

SPIRE LAB, IISc, Bangalore 8

Introduction

Problem statement

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C

ACOUSTIC MODEL BUILT ON

DATA

?

Single Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

SPIRE LAB, IISc, Bangalore 8

Introduction

Key Finding 1

SPIRE LAB, IISc, Bangalore 9

G.711A

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C

ACOUSTIC MODEL BUILT ON

DATA

?

Single Encoding-Decoding

Introduction

Key Finding 1

SPIRE LAB, IISc, Bangalore 9

G.711A

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C

ACOUSTIC MODEL BUILT ON

DATA

?

Single Encoding-Decoding

Introduction

Key Finding 2

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

C1

CN

C3

C2

CN

COCKTAIL ACOUSTIC MODEL

Cocktail Acoustic Model

SPIRE LAB, IISc, Bangalore 10

Introduction

Key Finding 2

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

C1

CN

C3

C2

CN

COCKTAIL ACOUSTIC MODEL

Cocktail Acoustic Model

SPIRE LAB, IISc, Bangalore 10

Previous Works

Section 2

1 Introduction

2 Previous Works

3 Experiments

4 Results

5 Conclusion

SPIRE LAB, IISc, Bangalore 11

Previous Works

Existing Literature

Single Encoding-Decoding

1 Lower recognition for low bit-rate codecs [Euler et al. (1994), Lilly et al. (1996)].

2 Study of speech recognition with GSM codecs [Kim et al. (2000), H.-G. Hirsch (2002)].

3 ASR under noisy conditions using G.729, G.723.1 and GSM codecs [Grande et al. (2001)]

Tandem Encoding-Decoding

1 Impact on ASR performance more for low bit-rate codecs [Lilly et al. (1996)].

2 Study of ASR performance under unkown Tandem scenario [Salonidis et al. (1998)].

Compensation Strategies

1 Enhancement of the decoded speech, robust feature extraction [Dufour et al. (1996)]

2 Adaptation of acoustic models [Mokbel et al.. (1997), Salonidis et al. (1998),Srinivasamurthy et al. (2001)]

SPIRE LAB, IISc, Bangalore 12

Experiments

Section 3

1 Introduction

2 Previous Works

3 Experiments

4 Results

5 Conclusion

SPIRE LAB, IISc, Bangalore 13

Experiments

AMTs and Codecs

SPIRE LAB, IISc, Bangalore 14

List of codecs

Codec Type Band-width

Bit-Rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

Details1 Kaldi toolkit [Povey et al. (2011)].

2 ASR metric: Phoneme Error Rate (PER)

3 Codecs source: IT-UT standards, SoX,SPEEX.

4 0-gram language model.

Acoustic Modeling Techniques (AMT)

1 Monophone based GMM-HMM (MONO)

2 Context-dependent triphone basedGMM-HMM (CD-TRI)

3 The Subspace Gaussian models withboosted Maximum Mutual Information(SGMM)

4 DNN with DBN Pretraining (DNN-DP)

5 DNN with state-level MBR(DNN-DP-sMBR)

Experiments

Datasets

SPIRE LAB, IISc, Bangalore 15

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

6 Tandem test databases: 1) ADPCM→GSM-8k→SPEEX, 2) ADPCM→SPEEX→GSM-8k,3) GSM-8k→ADPCM→SPEEX, 4) GSM-8k→SPEEX→ADPCM, 5)SPEEX→ADPCM→GSM-8k, 6) SPEEX→GSM-8k→ADPCM.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

8 acoustic models using singleencoding-decoding.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

TIMIT database. Sampling rate:8kHz.

Training set: 462 speakers with3696 utterances.

Development Set: 50 speakers with400 utterances.

Test Set: 24 speakers with 192utterances.

Experiments

Datasets

SPIRE LAB, IISc, Bangalore 15

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

6 Tandem test databases: 1) ADPCM→GSM-8k→SPEEX, 2) ADPCM→SPEEX→GSM-8k,3) GSM-8k→ADPCM→SPEEX, 4) GSM-8k→SPEEX→ADPCM, 5)SPEEX→ADPCM→GSM-8k, 6) SPEEX→GSM-8k→ADPCM.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

8 acoustic models using singleencoding-decoding.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

TIMIT database. Sampling rate:8kHz.

Training set: 462 speakers with3696 utterances.

Development Set: 50 speakers with400 utterances.

Test Set: 24 speakers with 192utterances.

Experiments

Datasets

SPIRE LAB, IISc, Bangalore 15

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

6 Tandem test databases: 1) ADPCM→GSM-8k→SPEEX, 2) ADPCM→SPEEX→GSM-8k,3) GSM-8k→ADPCM→SPEEX, 4) GSM-8k→SPEEX→ADPCM, 5)SPEEX→ADPCM→GSM-8k, 6) SPEEX→GSM-8k→ADPCM.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

8 acoustic models using singleencoding-decoding.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

TIMIT database. Sampling rate:8kHz.

Training set: 462 speakers with3696 utterances.

Development Set: 50 speakers with400 utterances.

Test Set: 24 speakers with 192utterances.

Experiments

Datasets

SPIRE LAB, IISc, Bangalore 15

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

6 Tandem test databases: 1) ADPCM→GSM-8k→SPEEX, 2) ADPCM→SPEEX→GSM-8k,3) GSM-8k→ADPCM→SPEEX, 4) GSM-8k→SPEEX→ADPCM, 5)SPEEX→ADPCM→GSM-8k, 6) SPEEX→GSM-8k→ADPCM.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

8 acoustic models using singleencoding-decoding.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

TIMIT database. Sampling rate:8kHz.

Training set: 462 speakers with3696 utterances.

Development Set: 50 speakers with400 utterances.

Test Set: 24 speakers with 192utterances.

Experiments

Datasets

SPIRE LAB, IISc, Bangalore 15

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

Codec Type Band-width

Bit-rate(kbps)

G.711A Waveform Narrow 64MELP Parametric Narrow 2.4AMR-NB

Hybrid Narrow 4.40

AMR-WB

Hybrid Wide 23.85

G.728 Hybrid Narrow 16G.729A Hybrid Narrow 8G.729B Hybrid Narrow 8PCM Waveform Narrow 128ADPCM Waveform Wide 32GSM-8k

Hybrid Narrow 13

SPEEX Hybrid Wide 27.8

6 Tandem test databases: 1) ADPCM→GSM-8k→SPEEX, 2) ADPCM→SPEEX→GSM-8k,3) GSM-8k→ADPCM→SPEEX, 4) GSM-8k→SPEEX→ADPCM, 5)SPEEX→ADPCM→GSM-8k, 6) SPEEX→GSM-8k→ADPCM.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

8 acoustic models using singleencoding-decoding.

CODEC 1

CODEC 1CODEC N

CODEC 2

CODEC 1

CODEC 2

CODEC 2 CODEC N

CODEC 1 CODEC 2 CODEC 3

TIMIT database. Sampling rate:8kHz.

Training set: 462 speakers with3696 utterances.

Development Set: 50 speakers with400 utterances.

Test Set: 24 speakers with 192utterances.

Experiments

Overview of Experiments: Single Encoding Decoding

SPIRE LAB, IISc, Bangalore 16

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

EVALUATE THE PERFORMANCE OF THE SELECTED TOP ACOUSTIC MODELS

? ? ? ? ?TOP ACOUSTIC MODELS

PCM

PCM

PCM

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

✓FIND THE TOP ACOUSTIC MODELS FROM 8

? ? ? ? ?TOP ACOUSTIC

MODELS

PCM

PCM

PCM

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

G.711A AMR-WB G.728 G.729A G.729B COCKTAIL

G.711A AMR-WB G.728 G.729A G.729B AMR-NB

ADPCMGSM-8KSPEEX

GSM-8KADPCMSPEEX

GSM-8KSPEEXADPCM

ADPCMSPEEX

GSM-8K

SPEEXADPCMGSM-8K

SPEEXGSM-8KADPCM

6TRAINED

ACOUSTIC MODELS

6 BLINDTEST SETS

PCM

PCM

PCM

Experiments

Overview of Experiments: Single Encoding Decoding

SPIRE LAB, IISc, Bangalore 16

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

EVALUATE THE PERFORMANCE OF THE SELECTED TOP ACOUSTIC MODELS

? ? ? ? ?TOP ACOUSTIC MODELS

PCM

PCM

PCM

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

✓FIND THE TOP ACOUSTIC MODELS FROM 8

? ? ? ? ?TOP ACOUSTIC

MODELS

PCM

PCM

PCM

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

G.711A AMR-WB G.728 G.729A G.729B COCKTAIL

G.711A AMR-WB G.728 G.729A G.729B AMR-NB

ADPCMGSM-8KSPEEX

GSM-8KADPCMSPEEX

GSM-8KSPEEXADPCM

ADPCMSPEEX

GSM-8K

SPEEXADPCMGSM-8K

SPEEXGSM-8KADPCM

6TRAINED

ACOUSTIC MODELS

6 BLINDTEST SETS

PCM

PCM

PCM

Experiments

Overview of Experiments: Single Encoding Decoding

SPIRE LAB, IISc, Bangalore 16

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

EVALUATE THE PERFORMANCE OF THE SELECTED TOP ACOUSTIC MODELS

? ? ? ? ?TOP ACOUSTIC MODELS

PCM

PCM

PCM

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

✓FIND THE TOP ACOUSTIC MODELS FROM 8

? ? ? ? ?TOP ACOUSTIC

MODELS

PCM

PCM

PCM

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1 CODEC 1

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

G.711A AMR-WB G.728 G.729A G.729B COCKTAIL

G.711A AMR-WB G.728 G.729A G.729B AMR-NB

ADPCMGSM-8KSPEEX

GSM-8KADPCMSPEEX

GSM-8KSPEEXADPCM

ADPCMSPEEX

GSM-8K

SPEEXADPCMGSM-8K

SPEEXGSM-8KADPCM

6TRAINED

ACOUSTIC MODELS

6 BLINDTEST SETS

PCM

PCM

PCM

Experiments

Overview of Experiments: Tandem Encoding Decoding

SPIRE LAB, IISc, Bangalore 17

COCKTAIL

ADPCMGSM-8KSPEEX

GSM-8KADPCMSPEEX

GSM-8KSPEEXADPCM

ADPCMSPEEX

GSM-8K

SPEEXADPCMGSM-8K

SPEEXGSM-8KADPCM

TRAINED ACOUSTIC MODELS

6 BLINDTEST SETS✓

EVALUATE THE PERFORMANCE OF THE SELECTED TOP

ACOUSTIC MODELS+COCKTAIL

MODEL

? ? ? ? ?

Results

Section 4

1 Introduction

2 Previous Works

3 Experiments

4 Results

5 Conclusion

SPIRE LAB, IISc, Bangalore 18

Results

Single Encoding Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C

ACOUSTIC MODEL BUILT ON

DATA

?

Single Encoding-Decoding

Question

What are the best acoustic models across all the AMTs for variouscoded speech?

8 Candidate Models: G.711A, MELP, AMR-NB, AMR-WB, G.728,G.729A, G.729B, PCM.

8 development and 8 test datasets.

SPIRE LAB, IISc, Bangalore 19

Results

Single Encoding Decoding: Choice of Top codecs

SPIRE LAB, IISc, Bangalore 20

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

✓FIND THE TOP ACOUSTIC MODELS FROM 8

? ? ? ? ?TOP ACOUSTIC

MODELS

PCM

PCM

PCM

Results

Single Encoding Decoding: Choice of Top codecs

MONO CD−TRI SGMM DNN−DP DNN−DP−sMBR15

20

25

30

35

40

45

50P

E R

(%

)

AMR−NB AMR−WB PCM G.711A G.728 G.729A G.729B MELP MATCHED

The average (standard deviation) PER (%) for 8 acoustic models and 5 AMTs across the

development sets.

ResultsPER decreases with the improvements in the AMTs.

Matched condition performs best across all the AMTs.

SPIRE LAB, IISc, Bangalore 21

Results

Single Encoding Decoding: Choice of Top codecs

MONO CD−TRI SGMM DNN−DP DNN−DP−sMBR15

20

25

30

35

40

45

50P

E R

(%

)

AMR−NB AMR−WB PCM G.711A G.728 G.729A G.729B MELP MATCHED

The average (standard deviation) PER (%) for 8 acoustic models and 5 AMTs across the

development sets.

ResultsPER decreases with the improvements in the AMTs.

Matched condition performs best across all the AMTs.

SPIRE LAB, IISc, Bangalore 21

Results

Single Encoding Decoding: Choice of Top codecs

AMR−WB PCM G.711A G.728 G.729A G.729B0

2

4

Codec

Count

Histogram of top four ranked codecs across different AMTs.

Results

Higher bit rate codecs.

Most of them are narrowband codecs.

SPIRE LAB, IISc, Bangalore 22

Results

Single Encoding Decoding: Choice of Top codecs

AMR−WB PCM G.711A G.728 G.729A G.729B0

2

4

Codec

Count

Histogram of top four ranked codecs across different AMTs.

Results

Higher bit rate codecs.

Most of them are narrowband codecs.

SPIRE LAB, IISc, Bangalore 22

Results

Single Encoding Decoding: Performance of top codecs

SPIRE LAB, IISc, Bangalore 23

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

G.711A AMR-WB G.728 G.729A G.729B AMR-NB MELP

8TRAINED

ACOUSTIC MODELS

8DEVELOPMENT

SETS

8TEST SETS

EVALUATE THE PERFORMANCE OF THE SELECTED TOP ACOUSTIC MODELS

G.711A AMR-WB G.728 G.729A G.729BTOP ACOUSTIC MODELS

PCM

PCM

PCM

Results

Single Encoding Decoding: Performance of top codecs

MONO CD−TRI SGMM DNN−DP DNN−DP−sMBR15

20

25

30

35

40

45

50P

E R

(%

)

PCM G.729A AMR−WB G.728 G.729B G.711A MATCHED

The average (standard deviation) PER (%) for the top 5 acoustic models (along with PCM and

Mixed) and 5 AMTs across the test sets

ResultsPER decreases with the improvements in the AMTs.

Least PER for G.711A based acoustic model.

SPIRE LAB, IISc, Bangalore 24

Results

Single Encoding Decoding: Performance of top codecs

MONO CD−TRI SGMM DNN−DP DNN−DP−sMBR15

20

25

30

35

40

45

50P

E R

(%

)

PCM G.729A AMR−WB G.728 G.729B G.711A MATCHED

The average (standard deviation) PER (%) for the top 5 acoustic models (along with PCM and

Mixed) and 5 AMTs across the test sets

ResultsPER decreases with the improvements in the AMTs.

Least PER for G.711A based acoustic model.

SPIRE LAB, IISc, Bangalore 24

Results

Tandem Encoding Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

2C

3

Tandem Encoding-Decoding

Question

How do the top five acoustic models perform across all the AMTs fortandem coded speech?

6 Candidate models: G.711A, AMR-WB, G.728, G.729A, G.729B,Cocktail.

6 blind test sets: Combinations of ADPCM, GSM-8k, SPEEX.

SPIRE LAB, IISc, Bangalore 25

Results

Tandem Encoding Decoding: Performance of top codecs

SPIRE LAB, IISc, Bangalore 26

COCKTAIL

ADPCMGSM-8KSPEEX

GSM-8KADPCMSPEEX

GSM-8KSPEEXADPCM

ADPCMSPEEX

GSM-8K

SPEEXADPCMGSM-8K

SPEEXGSM-8KADPCM

TRAINED ACOUSTIC MODELS

6 BLINDTEST SETS✓

EVALUATE THE PERFORMANCE OF THE SELECTED TOP

ACOUSTIC MODELS+COCKTAIL

MODEL

G.711A AMR-WB G.728 G.729A G.729B

Results

Tandem Encoding Decoding: Performance of top codecs

MONO CD−TRI SGMM DNN−DP DNN−DP−sMBR20

25

30

35

40

45

50

55

60

P E

R (

%)

AMR−WB G.729B G.728 G.729A G.711A COCKTAIL MATCHED

The average (standard deviation) PER (%) for 6 acoustic models and 5 AMTs across six blind

test sets

ResultsPER decreases with the improvements in the AMTs.

Least PER for G.711A based acoustic model.

Cocktail acoustic model is comparable to the matched condition.

SPIRE LAB, IISc, Bangalore 27

Results

Tandem Encoding Decoding: Performance of top codecs

MONO CD−TRI SGMM DNN−DP DNN−DP−sMBR20

25

30

35

40

45

50

55

60

P E

R (

%)

AMR−WB G.729B G.728 G.729A G.711A COCKTAIL MATCHED

The average (standard deviation) PER (%) for 6 acoustic models and 5 AMTs across six blind

test sets

ResultsPER decreases with the improvements in the AMTs.

Least PER for G.711A based acoustic model.

Cocktail acoustic model is comparable to the matched condition.

SPIRE LAB, IISc, Bangalore 27

Conclusion

Section 5

1 Introduction

2 Previous Works

3 Experiments

4 Results

5 Conclusion

SPIRE LAB, IISc, Bangalore 28

Conclusion

Key Finding 1

SPIRE LAB, IISc, Bangalore 29

NarrowbandHigh bit-rate

codec

G.711A

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C

ACOUSTIC MODEL BUILT ON

DATA

?

Single Encoding-Decoding

Conclusion

Key Finding 1

SPIRE LAB, IISc, Bangalore 29

NarrowbandHigh bit-rate

codec

G.711A

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

N

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C

ACOUSTIC MODEL BUILT ON

DATA

?

Single Encoding-Decoding

Conclusion

Key Finding 2

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

2C

3

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

ACOUSTIC MODEL BUILT ON

DATA

C1

C3

C2

CN

COCKTAIL ACOUSTIC MODEL

C1

C2

C3

Cocktail Acoustic Model

SPIRE LAB, IISc, Bangalore 30

Conclusion

Key Finding 2

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

C1

ACOUSTIC MODEL BUILT ON

DATA

?C

2C

3

Tandem Encoding-Decoding

FEATUREEXTRACTION

RECOGNIZEDPHONEME

CLEAN/ DISTORTED

DUE TO CODECS

ACOUSTIC MODEL BUILT ON

DATA

C1

C3

C2

CN

COCKTAIL ACOUSTIC MODEL

C1

C2

C3

Cocktail Acoustic Model

SPIRE LAB, IISc, Bangalore 30

Conclusion

Summary

Conclusions

1 Studied the codec induced distortion on the ASR performance.

2 G.711A, a narrowband high bit rate codec, results in the bestASR accuracy.

3 If the pool of tandem topologies are known a priori, cocktailacoustic model could be used.

Future works

1 Effectiveness of the best performing models along with languagemodels.

2 Compensation of the codec induced distortions to aid ASR.

SPIRE LAB, IISc, Bangalore 31

Conclusion

Summary

Conclusions

1 Studied the codec induced distortion on the ASR performance.

2 G.711A, a narrowband high bit rate codec, results in the bestASR accuracy.

3 If the pool of tandem topologies are known a priori, cocktailacoustic model could be used.

Future works

1 Effectiveness of the best performing models along with languagemodels.

2 Compensation of the codec induced distortions to aid ASR.

SPIRE LAB, IISc, Bangalore 31

Conclusion

Summary

Conclusions

1 Studied the codec induced distortion on the ASR performance.

2 G.711A, a narrowband high bit rate codec, results in the bestASR accuracy.

3 If the pool of tandem topologies are known a priori, cocktailacoustic model could be used.

Future works

1 Effectiveness of the best performing models along with languagemodels.

2 Compensation of the codec induced distortions to aid ASR.

SPIRE LAB, IISc, Bangalore 31

Conclusion

Summary

Conclusions

1 Studied the codec induced distortion on the ASR performance.

2 G.711A, a narrowband high bit rate codec, results in the bestASR accuracy.

3 If the pool of tandem topologies are known a priori, cocktailacoustic model could be used.

Future works

1 Effectiveness of the best performing models along with languagemodels.

2 Compensation of the codec induced distortions to aid ASR.

SPIRE LAB, IISc, Bangalore 31

Conclusion

Summary

Conclusions

1 Studied the codec induced distortion on the ASR performance.

2 G.711A, a narrowband high bit rate codec, results in the bestASR accuracy.

3 If the pool of tandem topologies are known a priori, cocktailacoustic model could be used.

Future works

1 Effectiveness of the best performing models along with languagemodels.

2 Compensation of the codec induced distortions to aid ASR.

SPIRE LAB, IISc, Bangalore 31

Conclusion

THANK YOU

SPIRE LAB, IISc, Bangalore 32

Recommended