30
7/21/2017 1 1 Department of Electrical Engineering , IIT Bombay EE679: Speech Processing A preview EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T. Bombay 2 Department of Electrical Engineering , IIT Bombay Why do we need a special course for signal processing of speech? “Signal processing” is concerned with the mathematical representation of the signal and the algorithmic operations carried out to modify the signal or to extract information from it. The representation and the algorithms are application domain specific, i.e. there are no “generic” methods. An understanding of the signal and of the application are crucial to the success of the signal processing methods

EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

1

1Department of Electrical Engineering , IIT Bombay

EE679: Speech Processing

A preview

EE679: Speech Processing

A preview

Dept of Electrical EngineeringI.I.T. Bombay

2Department of Electrical Engineering , IIT Bombay

Why do we need a special course for signal processing of speech?

“Signal processing” is concerned with the mathematicalrepresentation of the signal and the algorithmicoperations carried out to modify the signal or to extractinformation from it.

The representation and the algorithms are applicationdomain specific, i.e. there are no “generic” methods.

An understanding of the signal and of the application arecrucial to the success of the signal processing methods

Page 2: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

2

3

Human communication

• Vocal, visual, gestural

• Language is used for communication and is independent of the modality (writing, signing, speaking)

• Speech Communication is the transfer of information from one person to another via speech

Department of Electrical Engineering , IIT Bombay

4Department of Electrical Engineering , IIT Bombay

Understanding speech communication

Page 3: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

3

5Department of Electrical Engineering , IIT Bombay

Acoustic wavesSpeed = wavelength x frequency

6Department of Electrical Engineering , IIT Bombay

T0 =

3.3 msec

T0 = 10 msec

low pitch tone

high pitch tone

Frequency (Fo) = 1/To= 100 Hz

Frequency = 300 Hz

Air

pres

sure

var

iation

1 Hertz = 1 vibration/sec

Page 4: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

4

7

Speech “waveform”

Department of Electrical Engineering , IIT Bombay

8Department of Electrical Engineering , IIT Bombay

“Information” in speech?

• Linguistic (message -> sentences -> words -> phonemes)

The speech signal is characterised by an enormous range of elementary perceptually contrasting sounds!

• Paralinguistic: --expressive (emotions, mood)--speaker-based (age, gender, accent and style)

Page 5: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

5

9Department of Electrical Engineering , IIT Bombay

“Everyday” speech technology

• Mobile telephony (speech compression)

• Human-computer interfaces (speech recognition/synthesis)

• Security (speaker identification in biometrics, forensics)

• Speech enhancement (improving intelligibility or quality)

• Behavioural analytics

10Department of Electrical Engineering , IIT Bombay

Generating speech*

Respiration->phonation->articulation

Vibrating vocal cords create puffs of air giving rise to air pressure variations which reach our ears.

*HyperPhysics, Sound and Hearing, Georgia State University

Page 6: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

6

11Department of Electrical Engineering , IIT Bombay

.......;45;

43;

4 321 Lcf

Lcf

Lcf

Vocal tract: Acoustic resonances*

*HyperPhysics, Sound and Hearing, Georgia State University

(http://hyperphysics.phy-astr.gsu.edu/hbase/sound/)

12Department of Electrical Engineering , IIT Bombay

Vocal cords

Tongue Jaw

Lips

Teeth

Velum

Moving muscles which alter the resonant cavities Static cavity

Dynamic cavity

Vocalcavity

Pharyngeal

cavity

Velum

Nasal cavity

Oral Cavity

Articulators

Trachea connection to lungs

Oral sound output

Nasal sound output

Articulation: producing the various sounds of speech*

*Securivox tutorial

Page 7: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

7

13Department of Electrical Engineering , IIT Bombay

• The sound spectrum is modified by the shape of the vocal tract. • The resonant frequencies of the vocal tract cause peaks in the spectrum called formants.

Vocal tract “filter”*

*Childers, Speech Overview

14

Von Kempelen's talking machine

1791

"Briefly, the device was operated in the following manner. The right arm rested on the main bellows and

Page 8: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

8

15

1875

• Alexander Bell invents the method of, and apparatus for, “transmitting vocal or other sounds telegraphically ... by causing electrical undulations, similar in form to the vibrations of the air accompanying the said vocal or other sound”.

=> Major impetus to modern speech processing.

• 1930s: Electrical synthesis of speech by Dudley’s vocoder

Department of Electrical Engineering , IIT Bombay

16Department of Electrical Engineering , IIT Bombay

Sound -> electrical form*

*The Physics Classroom:http://www.glenbrook.k12.il.us/gbssci/phys/Class/sound/u11l2a.html

Page 9: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

9

17Department of Electrical Engineering , IIT Bombay

Speech Waveforms from “my speech”

(b) “ee” vowel

(c) “s” consonant

(a) start of “y” vowel

18Department of Electrical Engineering , IIT Bombay

Components of sound

A sound is usually comprised of several frequency components.

Depending on the relationships of the frequency components, the sound can elicit a sensation of pitch.

Page 10: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

10

19Department of Electrical Engineering , IIT Bombay

300 Hz

600 Hz

900 Hz

300 Hz + 600Hz

300 Hz + 600Hz + 900Hz

20Department of Electrical Engineering , IIT Bombay

Classification of speech sounds

Vowels and Consonants

• Vowels: steady sounds specified by position of the articulators (typically, tongue)

• Consonants: are (dynamic) sounds classifiedby place and manner of articulation

Page 11: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

11

21Department of Electrical Engineering , IIT Bombay

Place of articulation(constriction of vocal tract)

22Department of Electrical Engineering , IIT Bombay

Basic sounds of speech: Phones

• The speech signal can be divided into sound segments with fixed articulation and acoustics over short intervals.i.e. articulatory configuration <=> acoustic properties

Smallest meaningful sound unit: “phone” (i.e. set of distinctive sounds of a language)

In Indian written scripts, one symbol represents one phone.

Page 12: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

12

23Department of Electrical Engineering , IIT Bombay

24

PRAAT examples

Department of Electrical Engineering , IIT Bombay

Page 13: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

13

25

Physiology (articulator motion)

Sound with specific acoustic characteristics (seen in waveform and spectrum)

Perception of certain sound qualities

Department of Electrical Engineering , IIT Bombay

26Department of Electrical Engineering , IIT Bombay

Speech production basics

• Vocal cords (larynx) modulate the airflow from the lungs by rapid opening-closing; the rate of vibration is determined by their mass and tension. Pitch frequency ranges:male: 80-160 Hz; female:160-320 Hz; singers: over 2 octaves.

• Vocal tract shapes the vocal cord vibrations into the intricate sounds of speech via changes in shape to produce various acoustic resonances.

Page 14: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

14

27Department of Electrical Engineering , IIT Bombay

28

• Glottal folds in action…

Department of Electrical Engineering , IIT Bombay

Page 15: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

15

29

The interdisciplinary nature… *

Department of Electrical Engineering , IIT Bombay

* Fant, G. (1990). Speech research in perspective. Speech Communication.

30Department of Electrical Engineering , IIT Bombay

Outline

• Speech production (physiology)

• Classification of sounds: articulatory, acoustic

• Speech analysis (signal processing methods for information extraction)

• Hearing, and speech perception

• Speech technology (compression, ASR,TTS,…)

• Audio/music technology

Page 16: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

7/21/2017

16

31Department of Electrical Engineering , IIT Bombay

Text / References

• Douglas O'Shaughnessy, Speech Communications: Human and Machine, Universities Press (India) Ltd., 2001

• Rabiner and Schafer, Digital Processing of Speech Signals

• IITB Moodle for all course-related hand-outs

32Department of Electrical Engineering , IIT Bombay

Evaluation

• Computing assignments (Python or Scilab) (30%)

• Exams: mid semester + end semester (70%)

• Attendance is compulsory (<80% => XX, even before midsem)

Page 17: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Speech Production

Utterance: "Should we chase"

Acoustic waveform

Production of speech:

Glottal sourceWednesday, July 27, 2011 6:18 AM

Class-SP-1.4-print1 Page 1

Page 18: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Respiration <= Lungs•

Phonation <= Vocal cords•

Articulation <= Vocal tract•

Simple but important part of speech production. Respiration provides the air-flow and pressure source required for speech production. The lungs primarily serve breathing: inspiration, expiration.

Most languages sounds are formed during expiration (“egressive” sounds).•

Total lung capacity is 4-5 litre. The volume velocity of air leaving the lungs is about 0.2 lt/sec during sustained sounds.

Increased air-flow rate => increase in sound amplitude •

Respiration

Respiration: the air flow for speech production (lungs).•

Phonation: generation of basic sound by vibration of vocal cords (glottis). The otherwise smooth airflow is disturbed, causing sound.

Articulation: changing the spectrum of sound (vocal tract). It gives rise to different types of sound. The variation is generated by adjusting nature & shape of mouth cavity.

Class-SP-1.4-print1 Page 2

Page 19: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Vocal folds: anatomy and physiology

Pair of elastic structures of tendon, muscles and mucous membrane situated in the larynx. The variable opening between the folds is the “glottis”.In normal breathing, cords are parted to allow free passage of air.

Observing vocal fold motion:

electro-glottography○

video photography (see track9)•

The vocal cords functions chiefly in two modes:

With phonation: opening-closing periodic motion => periodic waveform1.

Without phonation: vocal folds are kept slightly parted => aperiodic (noisy) waveform2.

Phonation (vocal cords vibration) is an involuntary muscle action. It occurs when

(a) the vocal cords are elastic and close together, and(b) there is sufficient difference between sub-glottal and supra-glottal pressure

Anatomical views of Larynx and vocal folds <www.mayoclinic.com>

Phonation

Glottis

Class-SP-1.4-print1 Page 3

Page 20: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

(b) there is sufficient difference between sub-glottal and supra-glottal pressure

The aerodynamics…..

Electro-glottograph (EGG)Impedance is monitored via high-frequency current between electrodes across throat.

EGG is based on the principle that tissue is a moderate conductor whereas air is poor. A high frequency current is passed between electrodes positioned on either side of thyroid cartilage and electrical impedance is monitored => area of opening vs time.

Show EGG waveform (correlate of glottal opening).

But more typically, we show glottal vol. Velocity (cc/sec vs time). Not directly obtained from the glottal opening due to source-tract interaction (loading) effects. Rothenberg flow mask is used to measure flow at mouth opening and then formants are removed by inverse filtering.

Class-SP-1.4-print1 Page 4

Page 21: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Glottal pulses are not truly periodic but exhibit jitter and shimmer due to neurologic, biomechanical and aerodynamic disturbances.

Jitter: period to period variations in duration; normally < 1%Shimmer: period to period variations in amplitude; normally < 6%

Not normally directly perceptible but add to naturalness of the voice.

High jitter-shimmer => roughness

"Glottal flow signal can be approximated by 2-poles near dc. K. N. Stevens, ‘‘On the quantal nature of speech,’’ J. Phonet., 17, 3–46 (1989).

Voice quality is altered by modifying glottal vibration pattern.Voice quality changes can be non-phonemic or phonemic.

Rate of Vibration of the vocal cords

The average rate is inversely proportional to the length of the vocal folds.This length is correlated with neck circumference

Voluntary control: By means of muscle contractions, the vocal folds can be varied in length (tension), thickness and position configuration.

Folds are relaxed (short) and thick -> low pitchFolds are tense (long) and thin -> high pitch

Male: 80 - 160 HzFemale: 160 - 320 Hz

Class-SP-1.4-print1 Page 5

Page 22: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Types of Phonation : non-phonemic; speaker-dependent or controlled

Normal : or modal quality; can change with changing speed of glottal closure•

Breathy / Whisper :incomplete closure with posterior portion of the glottis always open; the airflow has periodic + noisy component; extent of breathiness depends on proportion of time vocal folds are open.

Creaky/Hoarse: folds are closed with a small part vibrating with irregular period.•

Falsetto: folds are thin and don't close completely; only central part vibrates with high rate.•

Pathological voices are rough, hoarse and quantified by measures of aperiodicity including breath noise

Class-SP-1.4-print1 Page 6

Page 23: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

"Phonemic" voice quality

We can divide all speech sounds based on whether produced with vocal folds vibration or without(held open with narrow constriction) into the categories

Voiced sounds-

Unvoiced sounds-

Vowels Fricatives Plosives

Voiced normal z, j, v b, d, g

Unvoiced whispered s, sh, f p, t, k

Other source of sound in glottis: Aspiration noise

Electronic Larynx

Class-SP-1.4-print1 Page 7

Page 24: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Class-SP-1.4-print1 Page 8

Page 25: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Class-SP-1.4-print1 Page 9

Page 26: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Articulation

The sound produced at the larynx passes through the vocal tract which alters the sound quality based on the selected positions of the articulators (tongue, jaw, lips, velum) changing the shape of the vocal tract "resonator".

From unsw acoustics site.

We can use the known expressions for resonances of a tube of given length and end (open/closed) conditions.

(These known expressions come from solving the Newton's 2nd law for sound propagation in the body to arrive at the constant o f proportionality in the Simple Harmonic Motion differential eqn).

From: Ladefoged, Acoustic Phonetics

Tube model for vocal tract:

Good approximation for the sound /uh/ as in "burn"

Vocal tract acoustics

To appreciate the role of the vocal tract, change your mouth shape while phonating at constant pitch and amplitude.

We can now see how we can independently control the larynx (source) and vocal tract articulators (filter) for different sounds.

Vocal tractMonday, August 20, 2012 1:25 PM

Class-SP-1.4-print1 Page 10

Page 27: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

For L=17.5 cm, C= 340 m/s => f = 500, 1500, 2500….. Hz

Tube approximation for /a/ as in "cart"

For L1 = L2 = 8.75 cm => f = 1000, 3000, 5000… Hz

Other vowels; Role of tongue, lips.Tongue position and height creates the vocal tract cavities. Rounding of lips changes length.

Nasal sounds: Branched resonator

In reality, there are perturbations in above values due to the coupling between the tubes. E.g. /a/ tubes' resonances at 1000 are really at 900, 1100 Hz.

Class-SP-1.4-print1 Page 11

Page 28: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Damped resonator: spectrum, waveform

Nasal cavity Closure of oral cavity + radiation of sound through nasal cavity.

Oral cavity acts as a side-branch resonator, introducing zeros (anti-resonances) based on its length.

Nasalised vowels:Both oral and nasal cavities are open and coupled but oral is more open. Thus nasal cavity acts like a anti-resonator.

Laterals, fricatives

Screen clipping taken: 7/28/2013, 8:38 PM

Laterals (l,r) have a side-cavity that introduces anti-resonances.

<- pocket of air above tongue

<- main cavity curves around tongue

Unvoiced consonants: There is a turbulent flow of air through a constriction within the vocal tract. This constriction creates a frication noise source that excites primarily the portion of the vocal tract in front of it. Depending on the place of the constriction we have different sounds: sh, s, f.

Effect of losses in the vocal tract:

Resonances and anti-resonances have zero bandwidth. But in practice, there are losses in the speech production system such as:

yielding (not rigid) walls that vibrate at low frequencies,

viscous friction between the air and walls and heat conduction through walls,

large yielding surface area of nasal cavity,

sound radiation at the lips.

Nasal consonants:

Class-SP-1.4-print1 Page 12

Page 29: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Also applies to musical instruments...

Lip radiation:The lips form a small opening so that diffraction (bending) of large wavelengths (low frequencies) takes place while high frequencies are directed in front => lip radiation is modeled by high-pass filter.

Screen clipping taken: 7/28/2013, 8:58 PM

B = -σ/ᴨω = 2ᴨF = 2ᴨ(1/T)

Source-filter model of speech production

For given formant frequency Fi Hz and bandwidth Bi Hz , we have for sampling period T:

θi = 2π.Fi.T

ri = e-πBiT

Digital resonator

For consonant phones:

Class-SP-1.4-print1 Page 13

Page 30: EE679: Speech Processing · 2019. 8. 19. · 7/21/2017 1 Department of Electrical Engineering , IIT Bombay 1 EE679: Speech Processing A preview Dept of Electrical Engineering I.I.T

Acoustic phonetics: the differentiation of sounds on an acoustic basis. The acoustics are more evident spectrally rather than in the time domain.

<---- Voicing and manner

Class-SP-1.4-print1 Page 14