Vocal Technologies - mica.edu.vn

International Research Institute MICAMultimedia, Information, Communication & Applications

UMI 2954

Hanoi University of Science and Technology

1 Dai Co Viet - Hanoi - Vietnam

Vocal TechnologiesFrom sound… to multilingualism

Pr. Eric Castelli

February, 2013

Part 2

Analysis of Speech signal

MICA

2013

Fast introduction

Basics of signal processing

Signals, systems, Fourier transform, spectra….

Analysis of speech signal

Phonemes, spectra, formants, sonagrammes...

Speech Synthesis

Speech signal Production

Vocal cords, vocal tract, modeling...

Speech signal coding

Automatic speech recognition

Basics of Speech databases

Content

2

MICA

2013

Speech

Linguistic information:

- what is pronounced by

the speaker

- contain, of course,

semantic information

Extra-linguistic information:

- speaker identity

- language

- physiologic and emotional

state of the speaker- emotions

- stress

- hill

Speech = information source

MICA

2013

Speech processing

Speech could be distinguished from other

sounds by its own acoustic characteristics

Speech sounds are produced by two different

process:

Vibration of the vocal cords

Voiced source (pseudo-periodic source)

Turbulences created by air flow

Passing quickly trough a vocal tract constriction

Or during an opening of a vocal tract occlusion

It is considered as a Noise source (pseudo random

source)

MICA

2013

Phonemes

For a language, the main function of sounds is to establish

distinctions between the significant units

Phonemes are the shorter acoustic elements which allow

us to distinguish the different words

Examples [p] [b] (in French)

pas / bas (negation/low)

paie / baie (wages/ berry or bay)

pot / beau (pot/beautiful)

MICA

2013

French phonemes

MICA

2013

French phonemes

MICA

2013

Vietnamese language structure

Vietnamese :

tonal language (as Mandarin, Canton language, Thai)

every syllable presents one tone

Vietnamese ma mà mã mả má mạ

French fantôme cependant cheval tombeau joue semis

syllable ma1 ma2 ma3 ma4 ma5 ma6

sound

Syllable structure

initial parttone inside the syllable

final part

initial sound

(optional)

pre-tonal

(optional)

vowel

(obligatory)

final sound

(optional)

Initial part: 21 consonants

Final part: 155 final parts

pre-tonal sound : 1 semi-vowel (optional)

vowel : 11 vowels + 4 diphtongues + 1 triphtongue (obligatory)

final sound : 6 consonants et 2 semi-vowels (optional)

MICA

2013

Speech signal

Bonsoir

b rõ s oa

MICA

2013

Speech signal

Vous êtes Monsieur Gilbert Dupont n’est-ce pas ?

MICA

2013

Main difficulty

The main characteristic of speech signal is

VARIABILITY

One person never pronounces same sound twice in

a same way

Two persons don’t produce the same speech signal

for a same word.

However, this sound is always well recognized and

well understood by humans

MICA

2013

Same speaker variability

Same sound, same speaker, Same recording conditions

MICA

2013

Variability between speakers

Same sound, same recording conditionsTwo different speakers

MICA

2013

Variability due to recording conditions

Same sound, same speaker, two different microphones

MICA

2013

Why analyze speech?

To study and understand the involved physical phenomena

Be a little less ignorant...

Also understand malfunctions (disabled person of language)

Be able to use this knowledge for learning foreign languages

To reproduce artificial speech

Speech synthesis (formant synthesis, diphone synthesis, HMM

synthesis, for instance)

Vocal tract modeling

Determining characterization measurements which can be

used by automatic speech recognition engines

Spectal characteristics (LPC, MFCC, fundamental frequency,

etc…)

MICA

2013

Speech processing: analysis – why?

Acoustic and spectral analysis

Nature of speech sounds in terms of Frequencies

Durations and timing constraints

Energy,

Co-ordination and co-articulation

etc.

Applications

understand physic phenomena involved in speech production

• Position of articulators, role of vocal system parts (vocal cords, vocal tract, etc.)

speech perception• How humans classify speech sounds ?

Define pertinent parameters For speech synthesis

• Elementary parts of signal, intonation, etc.

For speech recognition• Find a pertinent parameter vector for input of the recognition engine

For speech coding• Adapted filter (ADPCM, CELP, etc.) for efficient coding and good quality

MICA

2013

Direct measurements on human subject

Vocal tract transfer functionmeasurements

(Castelli, 1989)

MICA

2013

Direct measurements on human subject

Measurements of:- Radiated pressure- Flow at the lips output- High frequency energy- Low frequency energy- Fundamental frequency

Aerodynamics and acousticsmeasurements

MICA

2013

Direct measurements...

Disadvantages:

Difficult to realize

Need important equipments

Difficult to reproduce

No automatic procedures

The human subject must be trained

No movement during a long time

Move just ONE vocal tract articulator

Some measurements are intrusive

The human subject have difficulties to ”resist”

Modify “natural” conditions

Direct measurement of some characteristic are just impossible

For example: glottal flow

JUST FORFONDAMENTAL RESEARCH

MICA

2013

Speech analysis tools

MICA

2013

Production model: source-filter concept(or excitator-resonator)

The speech production system (vocal tract) can be

considered as a “speech instrument”

Resonator(filter)

Excitator(source)

cordes

corps

MICA

2013


Filter

Commands

Gain

Noise

source

Impulse

source

Speech

Vocal cords

Occlusion

F1

F2F3

f

Vocal tract

tSignal glottique

Excitator (source) Resonator (filter)

MICA

2013

Formants

Characterize the resonance frequencies of the vocal tract

F1

F2F3

f

spectra

energy

spectrogram

MICA

2013

Example in frequency domain: spectra & vowel formants [o]

Formants

English vowels /i/, /a/, /u/French vowel /o/

MICA

2013

Formants: vocalic triangle and dispersion ellipses

Dispersions are due to the important speech variability (and speaker variability)

For French

Acoustic triangle

Phonetic triangle

MICA

2013

Formants: some values

For French

MICA

2013

Formants & vocalic triangle for Vietnamese

F1/F2

100

200

300

400

500

600

700

800

900

1000

1100

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

a

i

u

ee

e

o

oo

ow

uw

F2/F3

2000

2200

2400

2600

2800

3000

3200

3400

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

a

i

u

ee

e

o

oo

ow

uw

One male speaker LA

MICA

2013

Formants & vocalic triangle for Vietnamese

11 vowels

a, e, ê, i, u, o, ô, ơ, ư, â,

ă

9 Vietnamese speakers

(males)

F2/F3

2000

2200

2400

2600

2800

3000

3200

3400

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

a

i

u

ee

e

o

oo

ow

uw

F1/F2

100

200

300

400

500

600

700

800

900

1000

1100

400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

a

i

u

ee

e

o

oo

ow

uw

ư

a

e

i

ơ

ă

â

MICA

2013

Spectrogram - 1

MICA

2013

Spectrogram - 2

MICA

2013

Spectrogram - 3

MICA

2013


Filter

Commands

Gain

Noise

source

Impulse

source

Speech

Vocal cords

Occlusion

F1

F2F3

f

Vocal tract

tSignal glottique

Excitator (source) Resonator (filter)

1) Already done

2) Next slides

MICA

2013

Fundamental frequency estimation

F0 = fundamental frequency (Pitch)

Corresponds to vocal cords vibration frequency

Zero-crossing of threshold-crossing

Autocorrelation

AMDF (Absolute Magnitude Difference Function)

Cepstre

FFT and Dirac “comb”

1

0

)()()(N

k

nkxkxnAmdf

)))))()((((())(( txtwFFTAbsLogIFFTtxC

i

r

k

txr

irkkP

kFFTkPrFP

)(][

][][][ )(

MICA

2013

Fundamental frequency estimation

Signal Hamming window Signal after windowing

Signal spectra signal autocorrelation Signal cepstre

MICA

2013

Fundamental frequency estimation: autocorrelation

The autocorrelation function is given by the folowing

formula:

For a pseudo-periodic signal, this fuction will present

this form:

We can detect maxima and deduce period and

compute fundamental frequency (= 1/period)

N

i

x inxixn0

)(*)()(

Fe/F

0

MICA

2013

Fundamental frequency estimation:zero-crossing

During vowel production, speech signal presents a pseudo-periodic form

It is sufficient to set-up a threshold

Zero-crossing

Threshold-crossing And count the number of “crossing”

threshold

MICA

2013

Fundamental frequency estimation: cepstre

Cepstre is computed following this formula:

On the resultant signal, the correspondent peak (to the source) is well visible.

)))))()((((())(( txtwFFTAbsLogIFFTtxC

Source peak

MICA

2013

Indirect F0 measurements:Glottal flow measurements

Delete from speech acoustic signal the vocal tract contribution

s

Pression rayonnée

s

Ug

s

bruit

Radiated pressure signal

picked at the lips

isolate the voiced part (purely harmonic part) of speech signal

LPC analysis

Iterative procedure : Iterative Adaptative Inverse Filtering

Vocal cords

noise

filter = vocal tract

• Source-filter representation of the speech production system

• Inverse filtering

• Sinusoid decomposition

MICA

2013

Obtained signal for glottal flow

0

0.24 0.25 0.26 0.27 0.28 0.29 0.3 s

0

0.24 0.25 0.26 0.27 0.28 0.29 0.3 s

Signal of radiated pressure at lips

Glottal flow derivative signal

MICA

2013

Glottal flow characterization

0

0.09 0.1 s

0

cm3/s

cm3/s2

U0

Ee

Ei

T0 = 1/F0

Tc

Amplitude U0

Period T0

Closure Time TC

Energies Ei and Ee

MICA

2013

Application to Vietnamese tones Analysis

Nguyen Quoc Cuong PhD work

Tons Description TiÕng ViÖt Signe

ton1 ton plat (ou égal) kh«ng dÊu (b»ng ou ngang) ( )

ton2 ton descendant huyÒn (`)

ton3 ton brisé ng· (~)

ton4 ton interrogatif hái ( ?)

ton5 ton montant (aigu) s¾c (‘ )

ton6 ton grave nÆng (.)

plain mélodique glottal

haut plat montant briséregistre

bas descendant interrogatif grave

Number of tone to be characterized: 8

Tone1 (flat-không dấu), tone2 (decreasing-huyền), tone3 (break-ngã), tone4 (interrogative-hỏi)

tone5a and tone6a: tone5 and tone6 for opened syllables (increasing-sắc)

tone5b and tone6b: tone5 and tone6 for closed syllables (low-nặng)

Measurement points

Two register

MICA

2013

Vientamese tone standard shapes – for one woman

Ton1 Ton2 Ton3 Ton4 - North Ton4 - South & Centre

Ton5a Ton5b Ton6a Ton6b

Maximal value minimal value


MICA

2013

Vietnamese tone standard shapes – for a man

Ton1 Ton2 Ton3

Ton5a Ton5b Ton6a Ton6b

Ton4 - North Ton4 – South & Centre


MICA

2013

Vietnamese tone durations

Comparison of relative durations :

N

YX

X – relative duration for tone i

N – mean duration for six tones same speaker

Y – mean duration for tone i for the speaker

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

ton1 ton2 ton3 ton4 ton5a ton5b ton6a ton6b

PNY

VTT

DPQ

DHH

DHL

BXH

TTA

Diagramme de durée relative des sujets du Nord

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

ton1 ton2 ton3 ton4 ton5a ton5b ton6a ton6b

NTH

VTH

BKH

LPL

LVS

TVH

HBQ

TTT

Diagramme de durée relative des sujets du Sud et du Centre

ton6a different between North

& South, Centre


MICA

2013

Some special cases for tone 6a

Un exemple de ton6a du sujet NTH avec syllabe "cạnh"

Un exemple de ton6a du sujet TTT avec syllabe " bịa".

Un exemple de ton6a du sujet BKH avec syllabe "cộng»

Usually all tone6a are decreasing and short

But we can find some specific cases for tone 6a for South and Center

Un exemple de ton6a « normal »


MICA

2013

Multimedia demonstration

Spectral Analysis

FFT, spectra

Spectrogramme

Sensimetrics software

MICA

2013

References

CALLIOPE La parole et son traitement automatique

1989, Masson, CENT, ENST

FANT G. Acoustic theory of speech production

Mouton, The Hague (1960)

COLEMAN J. Introducing speech and language processing

Cambridge University press (2005)

Daniel JURAFSKY D. & MARTIN J.H. Speech and language processing

Prentice Hall (2000)

Documents

Vocal Technologies - mica.edu.vn