Upload
phamphuc
View
224
Download
7
Embed Size (px)
Citation preview
Audio & Speech Technology for Consumer ElectronicsBasics and Technical Challenges
ICCE Consumer Electronics Society Webinar
Reinhard MOELLER University of Wuppertal
IEEE Consumer Electronics Society
221.09.1721.09.17
● Introduction● Historical Facts● Mathematical Elements of
Speech Technology ● Speech Processing
IEEE Consumer Electronics Society
321.09.1721.09.17
● Introduction
IEEE Consumer Electronics Society
421.09.1721.09.17
Introduction
● Human differentiates Sound and Noise
● Sound and Noise are evolutionary basis of communication between human and environmentHumans can feel and hear acoustic information
IEEE Consumer Electronics Society
521.09.1721.09.17
Principles of Sound
Sound • travels in waves, produced when an object pushes
on the air around it, causing small changes in air pressure.
• Properties: frequency, wavelength, period, amplitude, phase and speed
• Can be one single tone or a mixture of several tones with equal or different properties; Examples:
•music consists of a mixture of different frequencies and amplitudes•White noise – mix of frequencies with equal power distribution over a given frequency range,“unwanted” sound, harsh/crisp sounding noise•Pink noise – mix of frequencies with equal power distribution over a given logarithmic frequency scale, “naturally” sounding environment noise•speech
IEEE Consumer Electronics Society
621.09.1721.09.17
Human Audio „Sensors”: The Ears
The principle of hearing (after H. v. Helmholtz, 1873)
The inner ear is an active sound analyzer
IEEE Consumer Electronics Society
721.09.1721.09.17
Measurement of Sound
• The sound level heard by human ears is commonly measured in decibels
• Referring to sound, a decibel is used to measure the amplitude of the sound wave: 10 log (P2/P1) dB
• Unit Decibel is useful because it can represent the wide range of sound levels the human ear can hear using a more manageable scale
• On the decibel scale, the softest sound that can be heard is 0 dB (P1=P2). Every increase of 10 dB represents an approximate doubling of the perceived loudness of the sound
IEEE Consumer Electronics Society
821.09.1721.09.17
Dynamics of Human Hearing
Very soft
Extremly Loud
Dynamic rangeof a bicycle: 7:1
Dynamic range of a human ear: 1.000.000:1
Issues:How loud is too loud?What about hearing impairment?
IEEE Consumer Electronics Society
921.09.1721.09.17
Human Audio “Actuator”: Speech and Tone
● Speech production model: source-filter interaction– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum
Glottal pulses Vocal tract Speech signal
IEEE Consumer Electronics Society
1021.09.1721.09.17
● Historical Facts
IEEE Consumer Electronics Society
1121.09.1721.09.17
Pre-History of Audio and Speech Technology
● 1653: Cyrano de Bergerac„Sonderbare Geschichten der Staaten und Reiche des Mondes“
– .. books are little mechanical boxes like wristwatches.. reader fits its nerves and listens to the sound…
● 1786: Baron Münchhausen„Der Ritt auf der Kanonenkugel und andere Abenteuer„
– ..frozen sound carried in a post horn, melted behind a warm oven and resound..
IEEE Consumer Electronics Society
1221.09.1721.09.17
Pre-History of Audio and Speech Technology
● 1634: Keppler– „once we will produce speeking
machines, but they will have a snarling tone"
● 1761: Euler– „It would be one of our most important
inventions, if we could build a machine able to imitate all sounds of our words with all articulations... The thing does not seem to be impossible to me“
● 1773: Ch. G. Kratzenstein– Single vowels using resonance tubes
connected to organ pipes
IEEE Consumer Electronics Society
1321.09.1721.09.17
Pre-History of Audio and Speech Technology
● 1791: Wolfgang von Kempelen– „Mechanism of the human speech and
description of a speaking machine“
– The Chess Turk
– detailed construction plans, basis for later reconstructions and improvement
– called „..the first phonetitian..“
● 1824: Johann Maelzel– speaking dolly (Mama, Papa) Kempelen‘s Speaking Machine
Source: German Museum
IEEE Consumer Electronics Society
1421.09.1721.09.17
History of consumer audio recording
● 1877: Edison‘s Phonograph– Information carrier is a cylinder
– Intended applications:● dictaphone, voice recorder
● Archive of voices of famous people
– First recorded and replayed word: HELLO
IEEE Consumer Electronics Society
1521.09.1721.09.17
● History of consumer audio recording
1887: Berliner‘s Grammophon started success story of music
mass reproduction- wax coated zinc plate1892: pressed rubber disc1895: Schellack disc1896: Edison Spring motor
enhanced phonograph1908: double-sided disc1948: PVC
IEEE Consumer Electronics Society
1621.09.1721.09.17
History of consumer audio recording
1898: Piano Roll in mass production
IEEE Consumer Electronics Society
1721.09.1721.09.17
History of consumer audio electronics recording: Music media● 1930‘s: magnetizable tapes
● 1983 Digital Audio Tape (DAT)– originally for consumer use
– professional 8 channel S-VHS since 1993
● 1980: Red Book Standard (AudioCD)– 44.1 kHz, 16 Bit, 74 minutes
● 1990+: DVD Audio, Mini Disc, iPod,solid state disc & more
IEEE Consumer Electronics Society
1821.09.1721.09.17
Consumer audio electronics:Development towards spatial Audio
● 2 channel stereo: one-dimensional (width of stage)
● 2 channel surround: two dimensions (added depth of room)
● N channel 3D: added audio tracks for upper frequency bands
● N-channel object-based VR: binaural technology, outside head
● Future: Audio AR, i.e for gaming and navigation
Stereo(2-3 speakers)
Surround(5 to 7 speakers)
3D(7 plus speakers)
Audio VR(7 plus speakers)
Immersion
60‘s ~201670‘s
IEEE Consumer Electronics Society
1921.09.1721.09.17
Mathematical Elements of Speech Technology
IEEE Consumer Electronics Society
2021.09.1721.09.17
HMI: Dialog and Speech Understanding
“A symbolic description should be calculated from a speech signal, that allows a usable reaction of a system to a verbally expressed user demand in context of a human-machine dialog.”
according to: Sagerer, Automatisches Verstehen gesprochener Sprache, BI-Wiss.-Verl., 1990
IEEE Consumer Electronics Society
2121.09.1721.09.17
Mathematical Elements
● Elements– Signal, System, Frequency, Amplitude,
Phase, Spectrum– Sampling, Quantisation
● Acoustic Modells of Speech Production– Tube Model– Source-Filter-Model– Perturbation Model (Formant Shifting)
● Spectral Attributes of Sound Classes● Spectral Analysis
– Basics– Windowing
IEEE Consumer Electronics Society
2221.09.1721.09.17
Basics and Terminology
● Signal– analog (continuous in time and value)
• modulated Signals: amplitude-, frequency modulated
– digital (discrete time and discret value)● Signal parameters
– Frequency– Amplitude– Phase
● Spectrum
IEEE Consumer Electronics Society
2321.09.1721.09.17
Frequency, Amplitude, Phase
● Frequency = 1 / cycle time [Hz]
● Phase = displacement of a wave with respect to a fixed point in time
Cycle time
Amplitude
t =Time
• Waves with same phase
• Waves with different phase
IEEE Consumer Electronics Society
2421.09.1721.09.17
Analog to Digital Signal Conversion
● Analog Signal
● Sampling– Time becomes discrete
● Quantization– Values become discrete
IEEE Consumer Electronics Society
2521.09.1721.09.17
Sampling
● Nyquist/Shannon definition– Signal is fully reconstructable if
fsample > 2 fmax – Otherwise we get aliasing
● example speech analysis:– fmax ~ 7 kHz– fsample =16 kHz
● Sampling rate:– Number of samples per second
IEEE Consumer Electronics Society
2621.09.1721.09.17
Quantization
Quantization error
Sampling value
Mean value of interval
Maximum quantization error
IEEE Consumer Electronics Society
2721.09.1721.09.17
Topics of Speech Acoustics
● Concerned with signal processing and speech communication
● Topics:– Speech production, Vocal tract models
– Seech signal analysis
– Speech perception, Readability and -quality
– Speech- and Sound coding
– Speech synthesis
– Noise suppression, robust Speech-signal processing
– Speech recognition
– Speaker recognition
IEEE Consumer Electronics Society
2821.09.1721.09.17
Speech signal in time and frequency domain
The word „aua“ in time domain
The word „aua“ in frequency domain
IEEE Consumer Electronics Society
2921.09.1721.09.17
Signal Spectgrogram vs. Cascade Spectrogram
IEEE Consumer Electronics Society
3021.09.1721.09.17
● Wide-band Spectrogram– Shows formants (resonance
functions of vocal tract) = characteristics of filter
● Narrow-band Spectrogram– Shows harmonics =
characteristics of source
● Synonyme: Sonagram
Spectrogram II
IEEE Consumer Electronics Society
3121.09.1721.09.17
„flat“ Spectrogram (Sonagram)
time
freq
uenc
y
Amplitude shown by density
IEEE Consumer Electronics Society
3221.09.1721.09.17
Acoustic Models of Speech Production
● Source/Filter Model
● Tube Model
● Perturbation Model (formant shifting)
IEEE Consumer Electronics Society
3321.09.1721.09.17
1) Source/Filter Model
Source Filter Speech signal
Sound formingStimulation
IEEE Consumer Electronics Society
3421.09.1721.09.17
2) Tube Model
● Vocal tract modelled with tube elements of different diameters
Approximation of changing cross-sectionwith piecewise homogenous tubes Tube model
Glottis lips
IEEE Consumer Electronics Society
3521.09.1721.09.17
Simplified tube model
● assumption:
– The whole vocal tract is a homogenous tube
– Diameter is much less then length
– Equal diameter over length
– Glottis = total reflector
– Lips = open end
● Result: – resonant wave
IEEE Consumer Electronics Society
3621.09.1721.09.17
3) Formant shifting model
● Defined by local energy maxima in spectrum
● Center frequency is defined as formant frequency
● Independent of base frequency
● Based on resonance characteristics (size and form) of articulation tract
● 1st and 2nd formant define vowels
IEEE Consumer Electronics Society
3721.09.1721.09.17
Formant-Shifting (Perturbation Model)
● Increasing (+) resp. Minimizing (-) of the first three formants by shifting the local constriction of the articulation tract
IEEE Consumer Electronics Society
3821.09.1721.09.17
Sonagrams i, u, a
IEEE Consumer Electronics Society
3921.09.1721.09.17
Speech Recognition
IEEE Consumer Electronics Society
4021.09.1721.09.17
Interdisciplinarity of Speech Technology
Engineering / Computer Science
Computer Linguistics Phonetics
Natural Dialog, Speech-understanding, Text-to-Speech
i.e. Systems for:
Consumer Electronics
IEEE Consumer Electronics Society
4121.09.1721.09.17
Typical Tasks in Speech Recognition
SpeechRecognition
LanguageRecognition
SpeakerRecognition
Words
Language Name
Speaker Name
“How are you?”
English
Glenn Miller
Speech Signal
Goal: Automatically extract information transmitted in speech signal
IEEE Consumer Electronics Society
4221.09.1721.09.17
Three Steps of Speech Processing
Red
uctio
n o
f U
nce
rtai
nty
Grammar
Word
definitions
What does the speaker mean?
10alternatives
Speech Analysis
Knowledge about topic,
dialog partnerand context
What is the intent of the speaker?Unambiguous
understanding within the dialog
Speech
Understanding
Spoken Input
What did the speaker say?100
alternatives
Speech Recognition
Acoustic
Speech analysis
Word lists
acc. to W. Wahlster, DFKI
IEEE Consumer Electronics Society
4321.09.1721.09.17
Speech Recognition: Dependencies
● EnvironmentNoise, Acoustics, S/N ratio
● Speaker‘s stateHealth, stress, gender..
● Speaker`s literacylanguage, amount of words
● Softwaresystem, dynamics, algorithm, error handling
● Use Casetranslation, user-device dialog, robotics
● Hardwaremicrophones, speakers
● Dialog Architecturesoftware design
● Training
IEEE Consumer Electronics Society
4421.09.1721.09.17
Noise contamination of speech
Noise
Environmental Personal
Continuous Transient Related to breathing
Non related to breathing
e.g.
•Air Conditioner
•Motors
•Fans
•Continuous Conversation
e.g.
•Phone
•Vocal/
•Conversational
•Alarms
e.g.
•Body motion
•Respiratory infects/
•Distorted respiration
e.g.
•Indoor/ Outdoor Movement
•Clothes
•Joint crackles
IEEE Consumer Electronics Society
4521.09.1721.09.17
Acoustic Wave
PossiblePhonemes
PossibleWords
PossibleSentences
Speech Recognition
PossibleSentences
GrammarStructure
WordMeaning
Phrase/SentenceMeaning
Speech Analysis
SentenceMeaning
Discourse Meaning in Source Language
Phrase Choice inTarget Language
Speech Understandingand Translation
Discourse Meaning in Target Language
Phrase Choice inTarget Language
Sentence Production
Speech Synthesis
Prosody Generation
Generation and Synthesis
Process Chain in Speech Processing
IEEE Consumer Electronics Society
4621.09.1721.09.17
Remember: Technical Evaluation of a Speech Signal
● Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra
– Use a sliding window - 20 ms window, 10 ms shift
..
.
Fourier Transform
Fourier Transform MagnitudeMagnitude
• Produces time-frequency evolution of the spectrum
IEEE Consumer Electronics Society
4721.09.1721.09.17
Sonagram
Narrow-band Sonagram
Broad-band Sonagram
voiced voiced voiced
freq
uenc
yfr
eque
ncy
time (s)
formants
IEEE Consumer Electronics Society
4821.09.1721.09.17
Segmentability of Sonagrams: Phonemes
IEEE Consumer Electronics Society
4921.09.1721.09.17
Speech Recognition: Problems
acc. to W. Wahlster, DFKI
„Calligraphy“
Spontanuous speech
Nonlinear time distortion
Channel distortion
„Coctail party effect“
Co- articulation
Variation in speech (slang)
no break between words
Punctuation? Capitalization?
A very good morning Mrs. Lennard. How is the state of your actual workplan?
Hi Jane, what's up with your plans?
Hi Jane what's up with your plans
HiJanewhatsupwithyourplans
Uh Jaine, whatss up with ya plan
IEEE Consumer Electronics Society
5021.09.1721.09.17
Speech Recognition: Variety of Signals“Ich habe einen Termin um 17 Uhr 30”
IEEE Consumer Electronics Society
5121.09.1721.09.17
Speech Recognition: Word Hypothesis Graph
“It´s hard to recognize speech”
U Washington, CS
IEEE Consumer Electronics Society
5221.09.1721.09.17
Application to Consumer Electronics Dialog Systems
Systems Complexity
Siz
e o
f V
oca
bu
lary
Standard IVR Systems
Command & Control
“Star Trek Dialogs”Dictation
very high low
smal
lV
ery
larg
e
Telephone Dialogs
Dialog Systems
IEEE Consumer Electronics Society
5321.09.1721.09.17
Characteristics of speech processing systems
● Speaker-dependent: – high training efforts
– limited group of users
– highly individual and sensitve against small changes
● Speaker-independent: – no training, robust
– small word capacity
● Speaker-adaptive: – learning system
– instant improvement of recognition
Training efforts
• Single-word recognition: – recognition of isolated spoken
words
• Discrete recognition: – short breaks between words
• Continuous recognition: – no break between words
• Spontaneous recognition: – speech with or without delays– interrupted words
Input types
IEEE Consumer Electronics Society
5421.09.1721.09.17
Questions?
IEEE Consumer Electronics Society
5521.09.1721.09.17
Speaker Recognition
IEEE Consumer Electronics Society
5621.09.1721.09.17
...
Fourier Transform
Fourier Transform MagnitudeMagnitude
• Produces time-frequency evolution of the spectrum
Features for Speaker Recognition• Speech is a continuous evolution of the vocal tract
– Need to extract time series of spectra– Use a sliding window - 20 ms window, 10 ms shift
IEEE Consumer Electronics Society
5721.09.1721.09.17
General Theory
- Speaker Models -● Speaker models (voiceprints) represent voice biometric in compact and generalizable form
h-a-d
• Modern speaker verification systems use Hidden Markov Models (HMMs)
– HMMs are statistical models of how a speaker produces sounds
– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.
– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.
IEEE Consumer Electronics Society
5821.09.1721.09.17
Neural network-based speech recognition
Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden markov model part for language modeling.
IEEE Consumer Electronics Society
5921.09.1721.09.17
Following: Part II
Applications
IEEE Consumer Electronics Society
6021.09.1721.09.17
Psychoacoustics
University of Surrey, UK
IEEE Consumer Electronics Society
6121.09.1721.09.17
Voiceprint as a Biometric
• Biometric: a human generated signal or attribute for authenticating a person’s identity
• Voice is a popular biometric:– natural signal to produce
– ubiquitous: telephones, microphone equipped PC
• Voice biometric combined with other forms of security– Something we have
- e.g., badge
– Something we are - e.g., voice
– Something we know - e.g., password
Strongest security
HaveKnow
Are