Acoustics of Speech

Course Overview

Acoustics of SpeechLecture 4Spoken Language ProcessingProf. Andrew Rosenberg0OverviewWhat is in a speech signal?

Defining cues to phonetic segments and intonation.Techniques to extract these cues.1Phone RecognitionGoal: Distinguishing One Phoneme from AnotherAutomaticallyASR: Did the caller say I want to fly to Newark or I want to fly to New York?Forensic Linguistics: Did that person say Kill him or Bill himWhat evidence is available in the speech signal?How accurately and reliably can we extract it?What qualities make this difficult? easy?2Prosody and IntonationHow things are said is sometimes critical and often useful for understandingForensic Linguistics: Kill him! vs. Kill him?TTS: Travelling from Boston? vs. Travelling from Boston.What information do we need to extract from/generate in the speech signal?What tools do we have to do this?3Speech FeaturesWhat cues are important?Spectral FeaturesFundamental Frequency (pitch)Amplitude/energy (loudness)Timing (pauses, rate)Voice QualityHow do we extract these?Digital Signal ProcessingTools and AlgorithmsPraatWavesurferXwaves4Sound ProductionPressure fluctuations in the air caused by a voice, musical instrument, a car horn etc.Sound waves propagate through material air, but also solids, etc.Cause eardrum (tympanum) to vibrateAuditory system translates this into neural impulsesBrain interprets these as soundRepresent sounds as change in pressure over time5How loud are sounds?EventPressure (Pa)dBAbsolute silence200Whisper20020Quiet office2K40Conversation20K60Bus200K80Subway2M100Thunder20M120*Hearing Damage*200M1406Voiced Sounds are (mostly) PeriodicSimple Periodic Waves (sine waves) defined byFrequency: how often does the pattern repeat per time unitCycle: one repetitionPeriod: duration of a cycleFrequency: #cycles per time unit (usually second)Frequency in Hertz (Hz): cycles per second or 1 / periodE.g. 400 Hz = 1/0.0025 (a cycle has a period of 0.0025 seconds; 400 cycles complete in a second)Zero crossing: where the waveform crosses the x-axis7Voiced Sounds are (mostly) PeriodicSimple Periodic Waves (sine waves) defined byAmplitude: peak deviation of pressure from normal atmospheric pressurePhase: timing of a waveform relative to a reference point8Phase Differences9

Complex Periodic WavesCyclic but composed of multiple sine wavesFundamental Frequency (F0): rate at which the largest pattern repeats and its harmonicsAlso GCD of component frequenciesHarmonics: rate of shorter patternsAny complex waveform can be analyzed into its component sine waves with their frequencies, amplitudes and phases (Fourier theorem in 2 lectures)102 sine wave -> 1 complex wave11

4 sine waves -> 1 complex wave12

Power Spectra and SpectrogramsFrequency components of a complex waveform represened in the power spectrum.Plots frequency and amplitude of each component sine waveAdding temporal dimension -> SpectrogramObtained via Fast Fourier Transform (FFT), Linear Predictive Coding (LPC)Useful for analysis, coding and synthesis.13Example Power spectrum14

http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.htmlAustralian male /i:/ from heed FFT analysis window 12.8msExample Spectrogram15

Example Spectrogram from Praat15TermsSpectral Slice: plots the amplitude at each frequencySpectrograms: plots amplitude and frequency over timeHarmonics: components of a complex waveform that are multiples of the fundamental frequency (F0)Formants: frequency bands that are most amplified in speech.16Aperiodic WaveformsWaveforms with random or non-repeating patternsRandom aperiodic waveforms: white noiseFlat spectrum: equal amplitude for all frequency components.Transients: sudden bursts of pressure (clicks, pops, lip smacks, door slams, etc.)Flat spectrum at a single impulseVoiceless consonants17Speech Waveforms Lungs plus vocal fold vibration is filtered by resonance of the vocal tract to produce complex, periodic waveforms.Pitch range, mean, max: cycles per sec of lowest frequency periodic component of a signal = Fundamental frequency (F0)LoudnessRMS amplitudeIntensity: in dB where P0 is a reference atmospheric pressure18

Collecting speech for analysis?Recording conditionsA quiet office, a sound booth, an anechoic chamberMicrophones convert sound into electrical current oscillations of air pressure are converted to oscillations of currentAnalog devices (e.g. tape recorders) store these as a continuous signalDigital devices (e.g. DAT, computers) convert to a digital signal (digitizing)19Digital Sound RepresentationA microphone is a mechanical eardrum, capable of measuring change in air pressure over time.Digital recording converts analog (smoothly continuous) changes in air pressure over time to a digital signal.The digital representation:measures the pressure at a fixed time interval sampling raterepresents pressure as an integral valuebit depthThe analog to digital conversion results in a loss of information.20Waveform Name

21

Analog to Digital ConversionQuantization or Discretization22with example wave form. drawn by hand.22Analog to Digital ConversionQuantization or Discretization23with example wave form. drawn by hand.23Analog to Digital ConversionQuantization or Discretization24with example wave form. drawn by hand.24Analog to Digital ConversionQuantization or Discretization25with example wave form. drawn by hand.25Analog to Digital ConversionBit depth impact16bit sound CD Quality8bit soundSampling rate impact44.1kHz16kHz8kHz4kHz26

EXAMPLES26Nyquist RateAt least 2 samples per cycle are necessary to capture the periodicity of a waveform at a given frequency100Hz needs 200 samples per secNyquist Frequency or Nyquist RateHighest frequency that can be captured with a given sampling rate8kHz sampling rate (Telephone speech) can capture frequencies up to 4kHz27Sampling/storage trade offHuman hearing: ~20kHz top frequencyShould we store 40kHz samples?Telephone speech 300-4kHz (8kHz sampling)But some speech sounds, (e.g., fricatives, stops) have energy above 4kHzPeter, Teeter, Dieter44kHz (CD quality) vs. 16-22kHzUsually good enough to study speech, amplitude, duration, pitch, etc.Golden Ears.28FilteringAcoustic filters block out certain frequencies of soundsLow-pass filter blocks high frequency componentsHigh-pass filter blocks low frequenciesBand-pass filter blocks both high and low, around a bandReject band (what to block) vs. pass band (what to let through)What if the frequencies fo two sounds overlap?Source Separation29INCLUDE EXAMPLES FOR FILTERING29Estimating pitchPitch Tracking: Estimate F0 over time as a function of vocal fold vibrationHow? Autocorrelation approachA periodic waveform is correlated with itself, since one period looks like anotherFind the period by finding the lag (offset) between two windows of the signal where the correlation of the windows is highestLag duration, T, is one period of the the waveformF0 is the inverse: 1/T30Pitch IssuesMicroprosody effects of consonants (e.g. /v/)Creaky voice -> no pitch track, or noisy estimateErrors to watch for:Halving: shortest lag calculated is too long, by one or more cycles.Since the estimated lag is too long, the pitch is too low (underestimation) of pitchDoubling: shortest lag is too short. Second half of the cycle is similar to the firstEstimates a short lag, counts too many cycles per second (overestimation) of pitch31Pitch Doubling and Halving32

Halving ErrorDoublingErrorNext ClassSpeech Recognition OverviewReading: J&M 9.1, 9.2, 5.533

Documents

Acoustics of Speech