Audio & Speech Technology for Consumer … & Speech Technology for Consumer Electronics ... IEEE Consumer Electronics Society ... a decibel is used to measure the amplitude of the

Audio & Speech Technology for Consumer ElectronicsBasics and Technical Challenges

ICCE Consumer Electronics Society Webinar

Reinhard MOELLER University of Wuppertal

IEEE Consumer Electronics Society

221.09.1721.09.17

● Introduction● Historical Facts● Mathematical Elements of

Speech Technology ● Speech Processing


321.09.1721.09.17

● Introduction


421.09.1721.09.17

Introduction

● Human differentiates Sound and Noise

● Sound and Noise are evolutionary basis of communication between human and environmentHumans can feel and hear acoustic information


521.09.1721.09.17

Principles of Sound

Sound • travels in waves, produced when an object pushes

on the air around it, causing small changes in air pressure.

• Properties: frequency, wavelength, period, amplitude, phase and speed

• Can be one single tone or a mixture of several tones with equal or different properties; Examples:

•music consists of a mixture of different frequencies and amplitudes•White noise – mix of frequencies with equal power distribution over a given frequency range,“unwanted” sound, harsh/crisp sounding noise•Pink noise – mix of frequencies with equal power distribution over a given logarithmic frequency scale, “naturally” sounding environment noise•speech


621.09.1721.09.17

Human Audio „Sensors”: The Ears

The principle of hearing (after H. v. Helmholtz, 1873)

The inner ear is an active sound analyzer


721.09.1721.09.17

Measurement of Sound

• The sound level heard by human ears is commonly measured in decibels

• Referring to sound, a decibel is used to measure the amplitude of the sound wave: 10 log (P2/P1) dB

• Unit Decibel is useful because it can represent the wide range of sound levels the human ear can hear using a more manageable scale

• On the decibel scale, the softest sound that can be heard is 0 dB (P1=P2). Every increase of 10 dB represents an approximate doubling of the perceived loudness of the sound


821.09.1721.09.17

Dynamics of Human Hearing

Very soft

Extremly Loud

Dynamic rangeof a bicycle: 7:1

Dynamic range of a human ear: 1.000.000:1

Issues:How loud is too loud?What about hearing impairment?


921.09.1721.09.17

Human Audio “Actuator”: Speech and Tone

● Speech production model: source-filter interaction– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum

Glottal pulses Vocal tract Speech signal


1021.09.1721.09.17

● Historical Facts


1121.09.1721.09.17

Pre-History of Audio and Speech Technology

● 1653: Cyrano de Bergerac„Sonderbare Geschichten der Staaten und Reiche des Mondes“

– .. books are little mechanical boxes like wristwatches.. reader fits its nerves and listens to the sound…

● 1786: Baron Münchhausen„Der Ritt auf der Kanonenkugel und andere Abenteuer„

– ..frozen sound carried in a post horn, melted behind a warm oven and resound..


1221.09.1721.09.17


● 1634: Keppler– „once we will produce speeking

machines, but they will have a snarling tone"

● 1761: Euler– „It would be one of our most important

inventions, if we could build a machine able to imitate all sounds of our words with all articulations... The thing does not seem to be impossible to me“

● 1773: Ch. G. Kratzenstein– Single vowels using resonance tubes

connected to organ pipes


1321.09.1721.09.17


● 1791: Wolfgang von Kempelen– „Mechanism of the human speech and

description of a speaking machine“

– The Chess Turk

– detailed construction plans, basis for later reconstructions and improvement

– called „..the first phonetitian..“

● 1824: Johann Maelzel– speaking dolly (Mama, Papa) Kempelen‘s Speaking Machine

Source: German Museum


1421.09.1721.09.17

History of consumer audio recording

● 1877: Edison‘s Phonograph– Information carrier is a cylinder

– Intended applications:● dictaphone, voice recorder

● Archive of voices of famous people

– First recorded and replayed word: HELLO


1521.09.1721.09.17

● History of consumer audio recording

1887: Berliner‘s Grammophon started success story of music

mass reproduction- wax coated zinc plate1892: pressed rubber disc1895: Schellack disc1896: Edison Spring motor

enhanced phonograph1908: double-sided disc1948: PVC


1621.09.1721.09.17

History of consumer audio recording

1898: Piano Roll in mass production


1721.09.1721.09.17

History of consumer audio electronics recording: Music media● 1930‘s: magnetizable tapes

● 1983 Digital Audio Tape (DAT)– originally for consumer use

– professional 8 channel S-VHS since 1993

● 1980: Red Book Standard (AudioCD)– 44.1 kHz, 16 Bit, 74 minutes

● 1990+: DVD Audio, Mini Disc, iPod,solid state disc & more


1821.09.1721.09.17

Consumer audio electronics:Development towards spatial Audio

● 2 channel stereo: one-dimensional (width of stage)

● 2 channel surround: two dimensions (added depth of room)

● N channel 3D: added audio tracks for upper frequency bands

● N-channel object-based VR: binaural technology, outside head

● Future: Audio AR, i.e for gaming and navigation

Stereo(2-3 speakers)

Surround(5 to 7 speakers)

3D(7 plus speakers)

Audio VR(7 plus speakers)

Immersion

60‘s ~201670‘s


1921.09.1721.09.17

Mathematical Elements of Speech Technology


2021.09.1721.09.17

HMI: Dialog and Speech Understanding

“A symbolic description should be calculated from a speech signal, that allows a usable reaction of a system to a verbally expressed user demand in context of a human-machine dialog.”

according to: Sagerer, Automatisches Verstehen gesprochener Sprache, BI-Wiss.-Verl., 1990


2121.09.1721.09.17

Mathematical Elements

● Elements– Signal, System, Frequency, Amplitude,

Phase, Spectrum– Sampling, Quantisation

● Acoustic Modells of Speech Production– Tube Model– Source-Filter-Model– Perturbation Model (Formant Shifting)

● Spectral Attributes of Sound Classes● Spectral Analysis

– Basics– Windowing


2221.09.1721.09.17

Basics and Terminology

● Signal– analog (continuous in time and value)

• modulated Signals: amplitude-, frequency modulated

– digital (discrete time and discret value)● Signal parameters

– Frequency– Amplitude– Phase

● Spectrum


2321.09.1721.09.17

Frequency, Amplitude, Phase

● Frequency = 1 / cycle time [Hz]

● Phase = displacement of a wave with respect to a fixed point in time

Cycle time

Amplitude

t =Time

• Waves with same phase

• Waves with different phase


2421.09.1721.09.17

Analog to Digital Signal Conversion

● Analog Signal

● Sampling– Time becomes discrete

● Quantization– Values become discrete


2521.09.1721.09.17

Sampling

● Nyquist/Shannon definition– Signal is fully reconstructable if

fsample > 2 fmax – Otherwise we get aliasing

● example speech analysis:– fmax ~ 7 kHz– fsample =16 kHz

● Sampling rate:– Number of samples per second


2621.09.1721.09.17

Quantization

Quantization error

Sampling value

Mean value of interval

Maximum quantization error


2721.09.1721.09.17

Topics of Speech Acoustics

● Concerned with signal processing and speech communication

● Topics:– Speech production, Vocal tract models

– Seech signal analysis

– Speech perception, Readability and -quality

– Speech- and Sound coding

– Speech synthesis

– Noise suppression, robust Speech-signal processing

– Speech recognition

– Speaker recognition


2821.09.1721.09.17

Speech signal in time and frequency domain

The word „aua“ in time domain

The word „aua“ in frequency domain


2921.09.1721.09.17

Signal Spectgrogram vs. Cascade Spectrogram


3021.09.1721.09.17

● Wide-band Spectrogram– Shows formants (resonance

functions of vocal tract) = characteristics of filter

● Narrow-band Spectrogram– Shows harmonics =

characteristics of source

● Synonyme: Sonagram

Spectrogram II


3121.09.1721.09.17

„flat“ Spectrogram (Sonagram)

time

freq

uenc

y

Amplitude shown by density


3221.09.1721.09.17

Acoustic Models of Speech Production

● Source/Filter Model

● Tube Model

● Perturbation Model (formant shifting)


3321.09.1721.09.17

1) Source/Filter Model

Source Filter Speech signal

Sound formingStimulation


3421.09.1721.09.17

2) Tube Model

● Vocal tract modelled with tube elements of different diameters

Approximation of changing cross-sectionwith piecewise homogenous tubes Tube model

Glottis lips


3521.09.1721.09.17

Simplified tube model

● assumption:

– The whole vocal tract is a homogenous tube

– Diameter is much less then length

– Equal diameter over length

– Glottis = total reflector

– Lips = open end

● Result: – resonant wave


3621.09.1721.09.17

3) Formant shifting model

● Defined by local energy maxima in spectrum

● Center frequency is defined as formant frequency

● Independent of base frequency

● Based on resonance characteristics (size and form) of articulation tract

● 1st and 2nd formant define vowels


3721.09.1721.09.17

Formant-Shifting (Perturbation Model)

● Increasing (+) resp. Minimizing (-) of the first three formants by shifting the local constriction of the articulation tract


3821.09.1721.09.17

Sonagrams i, u, a


3921.09.1721.09.17

Speech Recognition


4021.09.1721.09.17

Interdisciplinarity of Speech Technology

Engineering / Computer Science

Computer Linguistics Phonetics

Natural Dialog, Speech-understanding, Text-to-Speech

i.e. Systems for:

Consumer Electronics


4121.09.1721.09.17

Typical Tasks in Speech Recognition

SpeechRecognition

LanguageRecognition

SpeakerRecognition

Words

Language Name

Speaker Name

“How are you?”

English

Glenn Miller

Speech Signal

Goal: Automatically extract information transmitted in speech signal


4221.09.1721.09.17

Three Steps of Speech Processing

Red

uctio

n o

f U

nce

rtai

nty

Grammar

Word

definitions

What does the speaker mean?

10alternatives

Speech Analysis

Knowledge about topic,

dialog partnerand context

What is the intent of the speaker?Unambiguous

understanding within the dialog

Speech

Understanding

Spoken Input

What did the speaker say?100

alternatives

Speech Recognition

Acoustic

Speech analysis

Word lists

acc. to W. Wahlster, DFKI


4321.09.1721.09.17

Speech Recognition: Dependencies

● EnvironmentNoise, Acoustics, S/N ratio

● Speaker‘s stateHealth, stress, gender..

● Speaker`s literacylanguage, amount of words

● Softwaresystem, dynamics, algorithm, error handling

● Use Casetranslation, user-device dialog, robotics

● Hardwaremicrophones, speakers

● Dialog Architecturesoftware design

● Training


4421.09.1721.09.17

Noise contamination of speech

Noise

Environmental Personal

Continuous Transient Related to breathing

Non related to breathing

e.g.

•Air Conditioner

•Motors

•Fans

•Continuous Conversation

e.g.

•Phone

•Vocal/

•Conversational

•Alarms

e.g.

•Body motion

•Respiratory infects/

•Distorted respiration

e.g.

•Indoor/ Outdoor Movement

•Clothes

•Joint crackles


4521.09.1721.09.17

Acoustic Wave

PossiblePhonemes

PossibleWords

PossibleSentences

Speech Recognition

PossibleSentences

GrammarStructure

WordMeaning

Phrase/SentenceMeaning

Speech Analysis

SentenceMeaning

Discourse Meaning in Source Language

Phrase Choice inTarget Language

Speech Understandingand Translation

Discourse Meaning in Target Language

Phrase Choice inTarget Language

Sentence Production

Speech Synthesis

Prosody Generation

Generation and Synthesis

Process Chain in Speech Processing


4621.09.1721.09.17

Remember: Technical Evaluation of a Speech Signal

● Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra

– Use a sliding window - 20 ms window, 10 ms shift

..

.

Fourier Transform

Fourier Transform MagnitudeMagnitude

• Produces time-frequency evolution of the spectrum


4721.09.1721.09.17

Sonagram

Narrow-band Sonagram

Broad-band Sonagram

voiced voiced voiced

freq

uenc

yfr

eque

ncy

time (s)

formants


4821.09.1721.09.17

Segmentability of Sonagrams: Phonemes


4921.09.1721.09.17

Speech Recognition: Problems

acc. to W. Wahlster, DFKI

„Calligraphy“

Spontanuous speech

Nonlinear time distortion

Channel distortion

„Coctail party effect“

Co- articulation

Variation in speech (slang)

no break between words

Punctuation? Capitalization?

A very good morning Mrs. Lennard. How is the state of your actual workplan?

Hi Jane, what's up with your plans?

Hi Jane what's up with your plans

HiJanewhatsupwithyourplans

Uh Jaine, whatss up with ya plan


5021.09.1721.09.17

Speech Recognition: Variety of Signals“Ich habe einen Termin um 17 Uhr 30”


5121.09.1721.09.17

Speech Recognition: Word Hypothesis Graph

“It´s hard to recognize speech”

U Washington, CS


5221.09.1721.09.17

Application to Consumer Electronics Dialog Systems

Systems Complexity

Siz

e o

f V

oca

bu

lary

Standard IVR Systems

Command & Control

“Star Trek Dialogs”Dictation

very high low

smal

lV

ery

larg

e

Telephone Dialogs

Dialog Systems


5321.09.1721.09.17

Characteristics of speech processing systems

● Speaker-dependent: – high training efforts

– limited group of users

– highly individual and sensitve against small changes

● Speaker-independent: – no training, robust

– small word capacity

● Speaker-adaptive: – learning system

– instant improvement of recognition

Training efforts

• Single-word recognition: – recognition of isolated spoken

words

• Discrete recognition: – short breaks between words

• Continuous recognition: – no break between words

• Spontaneous recognition: – speech with or without delays– interrupted words

Input types


5421.09.1721.09.17

Questions?


5521.09.1721.09.17

Speaker Recognition


5621.09.1721.09.17

...

Fourier Transform

Fourier Transform MagnitudeMagnitude

• Produces time-frequency evolution of the spectrum

Features for Speaker Recognition• Speech is a continuous evolution of the vocal tract

– Need to extract time series of spectra– Use a sliding window - 20 ms window, 10 ms shift


5721.09.1721.09.17

General Theory

- Speaker Models -● Speaker models (voiceprints) represent voice biometric in compact and generalizable form

h-a-d

• Modern speaker verification systems use Hidden Markov Models (HMMs)

– HMMs are statistical models of how a speaker produces sounds

– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.

– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.


5821.09.1721.09.17

Neural network-based speech recognition

Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden markov model part for language modeling.


5921.09.1721.09.17

Following: Part II

Applications


6021.09.1721.09.17

Psychoacoustics

University of Surrey, UK


6121.09.1721.09.17

Voiceprint as a Biometric

• Biometric: a human generated signal or attribute for authenticating a person’s identity

• Voice is a popular biometric:– natural signal to produce

– ubiquitous: telephones, microphone equipped PC

• Voice biometric combined with other forms of security– Something we have

- e.g., badge

– Something we are - e.g., voice

– Something we know - e.g., password

Strongest security

HaveKnow

Are

Documents

Audio & Speech Technology for Consumer … & Speech Technology for Consumer Electronics ... IEEE Consumer Electronics Society ... a decibel is used to measure the amplitude of the