Audio analysis 1SGN-24006
Introduction to Audio Content Analysis*Slides for this lecture were created by Anssi Klapuri
Contents:challenges, applicationsspeech recognitionmusic transcriptionauditory scene analysissources of information: acoustic signal, internal modelsmid-level data representationsdecomposing polyphonic signals along various dimensions
Audio analysis 2SGN-24006Understanding audio
Human listeners are very skilled at making sense ofcomplex audio signals
On a busy city street: noticing cars passing, footstepsapproaching, people discussing nearby, etc.In music: ability to focus on a certain instrument in a polyphonySpeech recognition: recognizing speech despite of speaker-dependent variation, environmental noise etc.
Audio analysis 3SGN-24006Computational analysis of audio
Audio content analysis is hard for many reasons:Polyphony: natural sounds do not occur in isolation, but oftenseveral active sound sources combine into a complex mixReal-world signals (speech, music, environmental audio) aredirty : analysis is a less controlled situation than synthesis (think
of 3D graphics vs. machine vision; or speech synthesis vs. speechrecognition)Perceptual attributes of sounds (pitch, timbre, loudness, etc.) havea non-trivial connection to acoustic properties.Cognition is ultimately AI-complete: humans interprete soundsutilizing the context and a high-level model of the world (artificial intelligence)
Audio analysis 4SGN-24006Audio analysis is of current interest
Computational power allows more complex problems andbigger amounts of data to be processedAmount of digital information increases rapidly
futile information without efficient management toolsshould process by content, not just as a pile of bits
Computers lack perceptioncommunication between humans anda computer is unnatural and inflexible(figure shows how dead-blind communicate)let s bring computers to the real world,not humans to the computers world
Audio analysis 5SGN-24006Applications
Speech recognition: Siri, Google voice searchMusic search: Shazam, SoundHoundMusic transcription: GuitarBots game, Wavetick lighting controlAudio classification: Movie soundtrack segmentationAuditory scene analysis: Mobile devices that adapt to situationalcontext, hearing aids (still in the making)
Audio analysis 6SGN-24006Automatic speech recognition
Ultimate goal: to accurately convert an acoustic signal intoa word sequence, independent of speaker andenvironmentTo improve accuracy, often assumptions are made
Speaker-dependent (or adapted) vs. speaker-independentClean-speech vs. environmentally-robust recognitionIsolated-word vs. continuous-speech recognition
ApplicationsSpeech interfacesSpoken documentretrievalSpeech-to-speechtranslation
He knew what taboos he was violating
Audio analysis 7SGN-24006
7 / klapMusictranscription
Excerpt from Song 34in RWC popularmusic database
Figures top-down:1. time-domain signal2. spectrogram3. musical notation4. piano roll
Applications includemusic retrievalintelligent processing
music tutors, gamesauto-accompanimentfor a soloist, etc.
Audio analysis 8SGN-24006Music transcription: subtopics
Beat tracking and meter analysis (beat/tactus = foot-tapping rate)
Multi-pitch estimation (potentially several concurrent pitches)
Drum and percussion transcription (+ instrument recognition)
Audio analysis 9SGN-24006Auditory scene analysis
Analysis of sounds from our living environmentRecognition of the context (home, street, restaurant, shop,train, office,...) and detection of individual sound sourcesand eventsApplications
context-awaremobile deviceshearing aidsmovie soundtracksegmentation andanalysis (footballgoal detection etc.)
Audio analysis 10SGN-24006Auditory scene analysis vs.
audio classification
Auditory scene analysis (of a car crash):very hard in the general case
Audio segmentation and classification (movie soundtrack):more straightforward
Audio analysis 11SGN-24006Sources of information:
Acoustic signal and internal modelsInternal models are crucial for robust analysis
Speech recognition systems depend on language models (e.g.probabilities of different word sequences)Musicological models are equally important for music transcriptionAuditory scene analysis employs event probabilities and sequentialdependencies in different contexts
Internal models can be learned from training material, andfurther adapted
Acousticsignal
Internal models
AnalysisResult
Audio analysis 12SGN-24006Mid-level data representations
Signal analysis can be viewed as a sequence of representations fromlow (audio signal) to high (recognition result)
Intermediate (mid-level) representations are necessary since the high-level information is usually not directly visible in the input audiosignalFigure below shows some example mid-level representations:1) spectrogram, 2) sinusoid tracks, 3) critical-band energiesMore about these on the coming lectures
Audio analysis 13SGN-24006Breaking up polyphonic audio signals
There are various dimensions along which an audio signal can bedecomposed
Time (temporal segmenting)Frequency (filtering)Space (angle of arrival)Sinusoids vs. noiseSound source separation (various approaches)
The various dimensions allow extracting layers of sound to someextent
Fundamentals
Intermediate difficulty, but straightforward
Ultimate goal, very difficult
Audio analysis 14SGN-24006
Audio analysis 15SGN-24006Spatial information (angle of arrival)
Important for human auditory scene analysis (natural environments)Usability of spatial information for music analysis depends on genre
Left Right Left Right
Audio analysis 16SGN-24006
Perceptually and musically, it is meaningful to make adistinction between tonal and noisy spectral elements
noise masks a tone more efficiently than vice versa, for exampleIn a musicspectrogram,horizontal (tonal)and vertical(noisy, percussive)structures areoften evident
Tonal vs. noise-like spectral components
Brentwood jazz quartet
Audio analysis 17SGN-24006More demos
http://arg.cs.tut.fi/demos