29
I I T B o m b a y a r j a y a n @ e e . i i t b . a c . i n , p c p a n d e y @ e e . i i t b . a c . i n 14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro. Landmark detection Exp. Res . Sum. Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3 rd February, 2008

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

  • Upload
    hazina

  • View
    55

  • Download
    2

Embed Size (px)

DESCRIPTION

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3 rd February, 2008. PRESENTATION OUTLINE. Introduction  Acoustic properties of clear speech - PowerPoint PPT Presentation

Citation preview

Page 1: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

1/27Intro. Landmark detection Exp. Res. Sum.

Detection of Acoustic Landmarks withHigh Resolution for Speech Processing

A. R. JayanP. C. PandeyV. K. Pandey

{arjayan, pcpandey,vinod}@ee.iitb.ac.in

EE Dept, IIT Bombay3rd February, 2008

Page 2: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

2/27Intro. Landmark detection Exp. Res. Sum.

PRESENTATION OUTLINE1. Introduction

Acoustic properties of clear speech Landmark detection Need for high time resolution

2. Automated landmark detection with high resolution Pass 1 Pass 2

3. Experimental results

4. Summary and conclusion

Page 3: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

3/27Intro. Landmark detection Exp. Res. Sum.

1. INTRODUCTIONAcoustic properties of clear speechClear speech: Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments

Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm

‘the book tells a story’‘the boy forgot his book’

Conversational Clear

Intelligibility of clear speech

▪ Picheny et al. ,1985: ~17% more intelligible than conversational speech▪ More intelligible for different classes of listeners & listening conditions

Page 4: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

4/27Intro. Landmark detection Exp. Res. Sum.

Acoustic differences between clear and conversational speech

Sentence level ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm)

▪ Larger variation in fundamental frequency

▪ Increased number of pauses, more pause durations

Word level ▪ Less sound deletions

▪ More sound insertions

Phonetic level ▪ Context dependent, non-linear increase in segment durations

▪ More targeted vowel formants

▪ Increase in consonant intensity

Page 5: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

5/27Intro. Landmark detection Exp. Res. Sum.

Improvement in intelligibility of conversational speech by incorporating properties of clear speech

Consonant–vowel intensity ratio (CVR) enhancementIncreasing energy of consonant segment

Consonant duration enhancementIncreasing CV and VC transitions (burst duration, VOT, formant transition)

Challenges

Accurate detection of regions for modification Analysis-modification-synthesis with low processing artifacts Processing without increasing overall speaking rate, increase in transition

regions with a corresponding dicrease in srteady state segments

Page 6: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

6/27Intro. Landmark detection Exp. Res. Sum.

Intelligibility enhancement using properties of clear speech

Hazan & simpson, 1998

manually labeled VCV and sentences intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB spectral modification by filtering

Colotte & Laprie, 2000

automated method for identifying regions based on mel-cepstral analysis stops and unvoiced fricatives amplified by +4 dB

transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)

Page 7: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

7/27Intro. Landmark detection Exp. Res. Sum.

Landmark detection

Speech landmarks Regions containing important information for speech perception

Associated with spectral transitions

Landmarks types

1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators

2. Abrupt (A) -Fast glottal or velum activity

3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction

4. Vocalic (V) - Vowel landmarks

Abrupt (~68%) Vocalic (~29%) Non-abrupt (~3%)

Page 8: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

8/27Intro. Landmark detection Exp. Res. Sum.

Landmarks

Page 9: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

9/27Intro. Landmark detection Exp. Res. Sum.

Liu, 1996

▪ Based on energy variation in 6 spectral bands0-0.4, 0.8-1.5, 1.2-2.0, 2.0-3.5, 3.5-5.0, 5.0-8 kHz

▪ Parameter: First difference of maximum energy (log) in each spectral band

time-step = 50 ms in coarse level, 26 ms in fine level

▪ Matching of peaks across bands for locating boundaries

▪ Detects glottal, sonorant closures, releases, stop closures, releases

Application: Extraction of features for supporting speech recognition

Page 10: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

10/27Intro. Landmark detection Exp. Res. Sum.

Detection rate vs. temporal resolution

73 %

83 %88 %

44 %

Uses same processing for all types of landmarks

Page 11: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

11/27Intro. Landmark detection Exp. Res. Sum.

Niyogi & Sondhi, 2002

for stop consonants total energy & energy above 3 k Hz in log scale measure of spectral flatness non-linear operator optimized for burst detection

Salomon et al., 2002

Hilbert transform based envelope to extract temporal parameters spectral information adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions)

Alani & Deriche, 1999

wavelet transform based decomposition energy variations in 6 bands

Page 12: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

12/27Intro. Landmark detection Exp. Res. Sum.

Need for high temporal resolution and detection rate

Application dependent Speech recognition: Analysis is performed around landmarks for parameter

extraction▪ high accuracy▪ moderate temporal resolution (20-30 ms)

Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms)

▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions

Landmark type ▪ Short duration events (bursts) need high time resolution

▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration

Page 13: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

13/27Intro. Landmark detection Exp. Res. Sum.

Factors limiting detection rate and temporal resolution

▪ Effectiveness of parameters in capturing acoustic variations▪ short-time energy variation in spectral bands

weak burst may not get detected▪ centroid frequency

not well defined during low energy segments▪ fixed band boundaries

may not adapt to speech variability

▪ Smoothening performed during parameter extraction

▪ temporal smoothening on spectrum affects time resolution

▪ Type of distance measure ▪ first difference operation not optimized for all types of landmarks

▪ time-step 10 ms is too high for burst detection

▪ Effect of noise on parameters

Page 14: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

14/27Intro. Landmark detection Exp. Res. Sum.

Acoustic cues for the different phonetic events are distributed non-homogeneously in the time-frequency plane

Separate detectors are required for each phonetic class

Each detector must use a method most suited for the phonetic event Objective

Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement

Page 15: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

15/27Intro. Landmark detection Exp. Res. Sum.

speech

Short-time spectalanalysis

Computation of energypeaks and centroids in the

frequency bands

Computation of spectraltransition index

Waveletdecomposition

around landmarks

Computation ofshort-time energy

and ZCRs

Pass 1

Computation of energy andcentroid RORs

Landmark localization

Pass 2

Landmarks(pass 1)

Computation ofenergy and ZCR

RORs

Landmark localisation

Landmarks(pass 2)

2. AUTOMATED LANDMARK DETECTION

Page 16: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

16/27Intro. Landmark detection Exp. Res. Sum.

Landmark detection using spectral peaks and centroids

Pass 1 Spectrum divided into five non-overlapping bands

▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz

▪ Sampling frequency 10 k samples/s,

▪ 512-point FFT on 6 ms frames

▪ frame rate 1 ms.

Parameters▪ maximum energy in each spectral band, every 1 ms

▪ band centroids estimated in each band, every 1 ms

▪ features similar to formant peaks and formant frequencies

▪ can be estimated easily

▪ not much affected by noise

Page 17: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

17/27Intro. Landmark detection Exp. Res. Sum.

2 22 2

1 1

( , ) /k k

f b n k X X f Nc sk kk k k k

2

1 210( , ) 10 log max ,E b n X k k kp k Peak energy

Centroid frequency

Rate-of-rise functions

Transition index

' , ( , ) ( , )E b n E b n K E b n Kp p p

' ( , ) ( , ) ( , )f b n f b n K f b n Kc c c

5 ' '( ) ( , ) ( , )1

T n E b n f b nr p cb

tracks simultaneous variation of energy and centroid

centroids given less weighting in low energy areas

Page 18: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

18/27Intro. Landmark detection Exp. Res. Sum.

Example: /uka/

Peak & centroid contours

0-0.4 kHz

0.4-1.2 kHz

1.2-2.0 kHz

2.0-3.5 kHz

3.5-5.0 kHz

Page 19: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

19/27Intro. Landmark detection Exp. Res. Sum.

Example: /uka/

Peak & centroid ROR contours

Time step = 26 ms

0-0.4 kHz

0.4-1.2 kHz

1.2-2.0 kHz

2.0-3.5 kHz

3.5-5.0 kHz

Page 20: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

20/27Intro. Landmark detection Exp. Res. Sum.

Example: /uka/

Transition index

derived from RORs with time step = 26 ms

Page 21: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

21/27Intro. Landmark detection Exp. Res. Sum.

Example: /uka/

Transition index

derived from RORs with time step = 4 ms

Less sensitive to slow transitions

Page 22: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

22/27Intro. Landmark detection Exp. Res. Sum.

Problems

Large time step ( > 20 ms)

▪ detects with less temporal accuracy

▪ detects slowly varying events also (more detection rate)

Small time step (< 5 ms)

▪ detects abrupt transitions with good resolution

▪ misses slow transitions.

Pass 2:

Analyze landmarks detected in Pass 1 with a small time-step

Page 23: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

23/27Intro. Landmark detection Exp. Res. Sum.

Improving Temporal resolution : Pass 2

▪ 40 ms window centered around burst landmarks detected in pass 1

▪ decomposed to 6 levels by discrete Meyer Wavelet

▪ detail (high frequency) contents in the lower two levels used for localizing bursts

Parameters ▪ short time energy variation

▪ zero crossing rate

Compute normalized RORs with a time-step of 3 ms

Get a new transition index as

Relocate landmark to the location corresponding to the peak in Tez(n)

2

1( ) 0.5 '( , ) '( , )ez n n

lT n E l n Z l n

Page 24: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

24/27Intro. Landmark detection Exp. Res. Sum.

Relocating stop landmarks

Page 25: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

25/27Intro. Landmark detection Exp. Res. Sum.

Relocating stop landmarks

Page 26: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

26/27Intro. Landmark detection Exp. Res. Sum.

Relocating stop landmarks

Page 27: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

27/27Intro. Landmark detection Exp. Res. Sum.

Stop 30 ms 20 ms 10 ms 5 ms

Initialvowel

Initialvowel

Initialvowel

Initialvowel

a i u a i u a i u a i u

/p/ - - - - - - - - - 1 1 2

/t/ - - - - - - - - - 1 1 2

/k/ - - - 1 - - 1 - 1 3 3 3

Det.%

100 98.1 96.3 68.5

Stop 10 ms 7 ms 5 ms 3 ms

Initialvowel

Initialvowel

Initialvowel

Initialvowel

a i u a i u a i u a i u

/p/ - - - - - - - - - - - -

/t/ - - - - - - - - - - 1 -

/k/ - - - - - - - - - - - -

Det.%

100 100 100 98.1

3. EXPERIMENTAL RESULTSTest material: VCV syllables

▪ 2 speakers (1 male, 1 female)

▪ 3 stop consonants (/p/, /t/, /k/)

▪ 3 initial and 3 final vowel contexts (/a/, /i/, /u/)

▪ Total 54 tokens

Pass 1 Pass 2

Page 28: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

28/27Intro. Landmark detection Exp. Res. Sum.

Test material: TIMIT sentences▪ 5 speakers (2 male, 3 female)▪ 10 sentences per speaker▪ closure and burst onsets of /b/, /d/, /g/, /p/, /t/, /k/▪ total 418 tokens

Phonemeclass

30ms 20 ms 10 ms

Det. (%) Det. (%) Det. (%)

Pass 1 2 1 2 1 2

Stop (548) 94 96 82 86 62 66

Fricative(266) 95 95 90 90 76 79

Nasal (154) 80 79 70 70 53 51

Vowel (614) 77 79 70 71 58 57

S. vowel (213) 69 70 68 67 60 61

Overall det. (%)

84.1 85.7 76.4 78.0 61.7 63.0

Detection rates Localization error

Page 29: Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

IIT B

om

bay

arja

yan

@e

e.i i

tb.a

c .in

, p

c pa

nd

ey@

ee

.i itb

.ac.

in14 th National Conference on Communications , 1-3 Feb. 2008, IIT Bombay, Mumbai, India

29/27Intro. Landmark detection Exp. Res. Sum.

4. SUMMARY & CONCLUSIONPass 2 improves temporal resolution of stop landmarks

▪ Significant improvement in stop burst localization in VCV syllables30% improvement for 5 ms resolution

▪ Marginal improvement in sentences4 % improvement for stop landmarks at 10 ms resolution

Possible reasons▪ reduced closure duration in sentences▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms

▪ use of 40 ms window in Pass 2, may need modification

▪ errors in the manual labels

▪ Future work: Evaluation of the method in presence of noise