Music Information Retrieval. Social and Commercial Implication Enormous amount of recorded music throughout the world Over 10,000 new CDs are released

Music Information Retrieval

Social and Commercial Implication

Enormous amount of recorded music throughout the world Over 10,000 new CDs are released every year the search for music - specifically, in the MP3 format

– is the most popular retrieval request In the U.S. alone, 1.08 billion units of recorded music

(e.g., CDs, cassettes, music videos, etc), valued at $14.3 billion, were shipped to retailers in the year 2000

Traditional Music IR mechanism

Text based examplesPeer-to-peer file sharing softwareFTPStreaming audioWebsitesOnline network drivesClip Art

mp3text

Content-based MIR

Content-Based MIR systems

• Based on high level music content: timbre, melody, rhythm, genre, mood

• Content-based recognition techniques: Feature extraction and Pattern recognition

Traditional MIR systems

• Based on symbolic representation: title, lyrics, composer, performer, singer

• Use similar techniques as text retrieval

Music Information Retrieval (MIR)

• The information is not enough to represent the content of the music• There are enormous amounts of music without any textual information

Disadvantages:

Scenario 1Timbre recognitionSinger recognitionMelody recognitionRhythm recognitionGenre recognitionMood recognition

Chopin’s Scherzo – a delightful piano melody

Previous preferences

MIR System

query

Music on Internet

Music in databases

Timbre recognitionSinger recognitionMelody recognitionRhythm recognitionGenre recognitionMood recognition

Summaries of ten best candidates

A series of other processes

......

until

Place an order for the final selection and get a full piece

Scenario 2Listen to the audio program “Music of the World”

Jazz bibliography

Cut the jazz pieces from the audio stream

Searching BibliographyMusic

Jazz Classical Pop R&B Rock

Bach Beethoven Mozart

Maintain a categorical archive

Content-based MIR

AdvantagesQuick and easy searchingLess problems with text labeling

DisadvantagesDifficulties in sound recognition

Philips Audio Fingerprinting

Audio Fingerprinting Microphone device captures your voice

“Audio fingerprints” are determined

Melody query then is sent to

a database Results are returned Good for finding a song you

don’t know by name but know

by tune

Jaap Haitsma and Ton Kalker. “A Highly Robust Audio Fingerprinting System”. Proceedings of ISMIR 2002, Paris, France, October 2002.

Philips Audio Fingerprinting Applications

music recognition over mobile phones build broadcast monitoring systems for

copyright verification commercial verification royalty metering

maintaining personal music archives allows the creation of music-aware networks, allowing

reliable control over the flow of copyrighted music an essential tool for building more secure digital audio

watermarks


Performance Efficient

Typically 3 seconds of music is enough to identify the music Fast

Robust Is not affacted by compression or environmental noise

Highly accurate Can discriminate between different versions of a song, even

if performed by the same artist


Main Challenges Unique features

These features are loosely analogous to normal fingerprints for human beings

Clever searching Algorithms to compare these audio fingerprints to large databases

of previously extracted audio fingerprints

Other Challenges Must be able to work with short segments of music The music that is offered to the fingerprint extractor is of poor quality The easiest way to lose customers is to provide poor performance the cell-phone recognition service must make economical sense

http://www.research.philips.com/InformationCenter/Global/FArticleSummary.asp?lNodeId=927

http://www.soundfisher.com

Keislar, D., T. Blum, T., J. Wheaton, & E. Wold,. “A content-ware sound browser”. Proc. of the International Computer Music Conference, ICMA, 1999.

SoundFisher Musle Fish







SoundFisher Supports Traditional Text and Numeric Queries

Performs text searches on keywords, comments, composer or performer names, and user-defined fields.

Searches and sorts by file attributes, such as sample rate, number of channels, etc.

Powerful "Sounds-like" Queries Finds sounds by example using selected sounds or user-defined s

ound categories. Searches and sorts sounds by similarity, based on pitch, loudness,

brightness and/or overall timbre But, SoundFisher is not intended as a search mechanism for musi

c catalogs. In other words, they do not address sound at the level of the musical phrase, melody, rhythm or tempo.

Permits Flexible Categories Supports sound categories that contain nested categories. No need to conform to predefined categories.

Musle Fish

IRCAM Studio-On-Line An content-based

search & classification interface

The "Sound Palette" provides access to an instrumental sound database, as well as a Web site on Instruments and their playing modes.

IRCAM Studio-On-Line

Offers a primitive search-by-perceptual-similarity function Search available sounds through high-level criteria (sound

categories, dynamic profiles, timbre similarity, etc.) Offers manual and automatic sound labeling and

classification Learns classification criteria from user-provided sample training

sets, and then performs automatic classification of newly entered samples among the learned classes

Access to audio material in various formats mp3 streaming, audio files, and compressed archives download

Music Recognition

Music Recognition Timbre recognition Singer recognition Melody recognition Rhythm recognition Genre recognition Mood recognition

Challenges:

• Multirepresentational Challenge

• Multicultural Challenge

• Multiexperiential Challenge

Unique featuresClever searching

"Sounds-like" queries and categories

search-by-perceptual-similarity

Audio Fingerprinting

SoundFisher

Studio-On-Line

Music Recognition

Timbre recognition Find solo violin pieces Which works use the following combination of instruments?

Singer recognition Find songs of Sting Who sings the song I just played? I want to find some songs for karaoke, where the singer’s voice

is similar to my own

Music Recognition

Melody recognition Find songs with a similar melody to what I am listening to right

now Rhythm recognition

Find songs with a dance rhythm Genre recognition

Find song whose style is similar with the one I am listening to now

Mood recognition Find a soft song to calm myself after a hard day at UST (Univ. of

Stress & Tension)

Music Recognition Automatic Music Transcription Systems

Transcribe soundfiles to high level musical notation (MIDI files or sheet scores)

Music Annotation and Indexing Segment audio streams and assign symbols to indicate the

content of the segment. MPEG-7 descriptors

The MPEG description consists of semantic descriptors (e.g., type of music), and perceptual features describing the audio content.

Video Indexing and Retrieval Audio is an important component in video indexing and retrieval.

Timbre Recognition

The four basic perceptual attributes of sound Pitch / Fundamental frequency Loudness / Amplitude Duration Timbre

Definition: Timbre is the quality of a sound by which a listener can judge that

two sounds of the same loudness and pitch are dissimilar.

Instrument Recognition Recognize instrument families and individual instruments

Timbre

Five classes of instruments The strings

Instruments with strings that are played by touching them with a bow or plectrum.

violin, viola, cello, double bass, guitar The brass

Wind instruments made of brass trumpet, trombone, French horn, tuba

The woodwinds Wind instruments often made of wood (flutes and reeds)

Double reeds, clarinets, saxophones, flutes, piccolo, bassoon The percussion

Timpani, marimba, drums, cymbal, gong The keyboard

Harpsichord, piano, organ

What to Recognize

instrument family individual instrument

AND

Timbre Recognition Instrument families

Strings, brass, woodwinds, percussion, keyboard A taxonomic hierarchy

Piano

Piano

All Instruments

Released Sustained

Pizzicato strings

Guitar Violin Viola Cello Double Bass

Bowed strings

Violin Viola Cello Double Bass

Flute or Piccolo

FluteAlto fluteBass flutePiccolo

Brass or Reeds

Reeds

Oboe English horn Bassoon Clarinet Saxophone

Brass

TrumpetFrench hornTromboneTuba

Recognition Systems

Human Recognition System

sensory transduction

auditory grouping

analysis of features

matching with lexicon

meaning&significance lexicon of names

recognition

McAdam’s model of human auditory processing

Little is know about the human sound source recognition system

Recognition Systems

Classification

TrainingS(n) Pre-

processing Feature Extractor

Multi-level representation

Classifier

Model Training

Instrument model

Temporal features

Cepstral features

Spectral features

Other features Timbre

Recognition Systems

Monophonic recognition Polyphonic recognition

Single note, professionally recorded or synthesized with high fidelity

Overlapped sounds of different instruments played together, a duet, a trio, or an orchestral piece

Simple, but includes most of the fundamental techniques

Difficult, more complex techniques such as pitch tracking and source separation needed

Can be used to evaluate timbre features, because timbre is especially obvious when there is only one note

More practical and useful since most of the music recordings are polyphonic

Many evaluation sample collections exist, but still incomplete

No good sample collections. Usually evaluated with very small dataset

Recognition Systems Evaluation Criteria

Accuracy The system should be able to recognize different kinds of instrume

nts with high accuracy. Generality

The recognition should not depend on a particular performer and the particular acoustic environment.

Robustness The system should ideally be able to handle real world sounds with

noise, reverberation, and competing sound sources. Scalability

The system should be able to accept a new sound source and learn to recognize it without decreasing the system performance. When new sound sources are continually introduced to the system, the performance should decrease gradually.

Realtime The system should be able to recognize a source in realtime

Recognition Systems

Evaluation data collections Monophonic collections

Using one sample collection in the evaluation Using several sample collections in the evaluation

McGill University Master Samples (MUMS)University of Iowa Musical Instrument SamplesIRCAM Studio-On-Line Samples (IRCAM SOL)RWC Music Database

No good sample collections for polyphonic music Researchers have used their own data collections in the evaluation.

Use single data collection in the evaluation

Features Accuracy Evaluation data

Kaminskyi & Voumard 96

7 98% 19 instruments: the instruments are very different and note range is small

Martin and Kim 98

31 70% / 90% 1023 sounds of 14 instruments in McGill collection

Fujinaga 98 7 50.3% Over 1300 sounds of 39 timbres from 23 instruments in McGill collectionFujinaga 99 20 64%

Fujinaga 00 22 68%

Eronen & Klapuri 00

43 80% / 94% 1498 sounds of 30 instruments from McGill collection

Petters & Rodet

81 86% / 89% 1400 sounds of 14 instruments from IRCAM SOL

27 81% / 87%

Use multiple data collections in the evaluation

Martin 99 31 39% / 76% 1500 sounds of 27 instruments from three sources: McGill, MIT music Library’s compact disc collection, and recordings made especially for this project

Eronen 01 38 35% / 77% 5286 sounds 0f 29 instruments from five sources: McGill, Tampere guitar collection, UIowa, IRCAM SOL, and Roland XP-30 synthesizer

Livshine, Petters & Rodet

N/A N/A 1325 sounds of 16 instruments from five sources: IRCAM SOL, UIowa, McGill, Prosonus and Vitus collections

Feature Extraction

Feature Extraction

A violin note

Temporal Features

Spectral Features

Cepstral Featuers

An audio clip

s1 s2 s3 sn sM

S1 S2 S3 Sn SM

DFT

Temporal Features Frame features

Amplitude & Loudness Root Mean Square

Short time Energy

1

0

2 )(1

)(N

in iS

NnRMS

1

0

2)]1()([1

)(N

in iNwiS

NnSTE

Clip Features Combination of frame features

Mean and standard deviation of RMS

Feature Extraction

A violin note

Temporal Features

Spectral Features

Cepstral Featuers

An audio clip

s1 s2 s3 sn sM

S1 S2 S3 Sn SM

DFT

The shape of the spectral envelope is closely related to timbre

Spectral features can describe some of the spectral envelope

Spectral Envelope

ω

Sn(ω)

Spectral Features

Spectral Moments First Order Moment / Frequency Centroid

Center frequency weighted by squared amplitude

N

i in

N

i iniM

S

Sn

0

2

0

2

)(

)()(

k

k

ω

Sn(ω)

Spectral Features

Spectral Centroid Moments Weighted average difference between spectral components an

d frequency centroid Band-width: square root of the second order centroid moment Skewness: third order centroid moment

N

i in

N

i ink

MikC

S

Sn

0

2

0

2

)(

)()()(

ω

Sn(ω)

Spectral Features

Subband Energy and Subband Energy Ratio Analogous to frequency bands in human ears Represent the energy distribution of the spectrum

ω

Sn(ω)

1/8 1/8 1/4 1/2

))(log()( j

j

H

Linj SnE

j j

jj nE

nEnER

)(

)()(and

Spectral Features

Spectral Irregularity Represents the jaggedness of spectral envelope

ω

Sn(ω)

1

0

2

2

0

2

)(

))1()(()( N

i n

N

i nn

iS

iSiSnI

Spectral Features

Formant Features Formant Frequency & Formant Amplitude

The position and amplitude of first two formants are the most important

Pitch : Fundamental frequency Tristimulus

The percentage of the low-order formants compared to the higher ones

ω

Sn(ω)

Cepstral Coefficients Source-filter Model

Source: periodic excitation of strings Filter: the resonator, body of an instrument

The shape of the filter spectrum represents the spectral envelope How to extract the filter properties ― Cepstral Coefficients

Source signal Filter Output signal

white noise

excitation spectrum

filter spectrum signal spectrum

Feature Extraction

A violin note

Temporal Features

Spectral Features

Cepstral Featuers

An audio clip

s1 s2 s3 sn sM

S1 S2 S3 Sn SM

DFT

Mel-frequency Cepstral CoefficientsSignal s

Frame separating

Preprocessing

Windowing

DFT

Mel-scaling

Logarithm

DCT

Cepstrum

Spectrum

Mel Scaling Human auditory system perceives sound

logarithmically

Discrete Cosine Transform (DCT) DCT is taken to separate the filter and exci

tation properties. The low-order cepstrum is the compact re

presentation of the filter

Feature Evaluation

Feature Evaluation –Sample Collections

Evaluated by sample collections Build a recognition system for each evaluated feature or feature

set Use sample collections to calculate the system performance Evaluate features by the system performance (accuracy)

Advantages Easy to carry out There are free sample collections

Feature Evaluation –Sample Collections

Disadvantages Diversity of the Music

Some sample collections don’t have same properties as other sample collections

Using not enough sample collections decreases the generality of the system. The accuracy of these systems is skewed.

These systems do not satisfy the generality criterion Since we do not know how many sample collections are needed,

it may not be reliable to use incomplete sample collections to: Evaluate a recognition system Evaluate the effectiveness of a feature

Documents

Music Information Retrieval. Social and Commercial Implication Enormous amount of recorded music throughout the world Over 10,000 new CDs are released