View
216
Download
3
Tags:
Embed Size (px)
Citation preview
Social and Commercial Implication
Enormous amount of recorded music throughout the world Over 10,000 new CDs are released every year the search for music - specifically, in the MP3 format
– is the most popular retrieval request In the U.S. alone, 1.08 billion units of recorded music
(e.g., CDs, cassettes, music videos, etc), valued at $14.3 billion, were shipped to retailers in the year 2000
Traditional Music IR mechanism
Text based examplesPeer-to-peer file sharing softwareFTPStreaming audioWebsitesOnline network drivesClip Art
mp3text
Content-based MIR
Content-Based MIR systems
• Based on high level music content: timbre, melody, rhythm, genre, mood
• Content-based recognition techniques: Feature extraction and Pattern recognition
Traditional MIR systems
• Based on symbolic representation: title, lyrics, composer, performer, singer
• Use similar techniques as text retrieval
Music Information Retrieval (MIR)
• The information is not enough to represent the content of the music• There are enormous amounts of music without any textual information
Disadvantages:
Scenario 1Timbre recognitionSinger recognitionMelody recognitionRhythm recognitionGenre recognitionMood recognition
Chopin’s Scherzo – a delightful piano melody
Previous preferences
MIR System
query
Music on Internet
Music in databases
Timbre recognitionSinger recognitionMelody recognitionRhythm recognitionGenre recognitionMood recognition
Summaries of ten best candidates
A series of other processes
......
until
Place an order for the final selection and get a full piece
Scenario 2Listen to the audio program “Music of the World”
Jazz bibliography
Cut the jazz pieces from the audio stream
Searching BibliographyMusic
Jazz Classical Pop R&B Rock
Bach Beethoven Mozart
Maintain a categorical archive
Content-based MIR
AdvantagesQuick and easy searchingLess problems with text labeling
DisadvantagesDifficulties in sound recognition
Philips Audio Fingerprinting
Audio Fingerprinting Microphone device captures your voice
“Audio fingerprints” are determined
Melody query then is sent to
a database Results are returned Good for finding a song you
don’t know by name but know
by tune
Jaap Haitsma and Ton Kalker. “A Highly Robust Audio Fingerprinting System”. Proceedings of ISMIR 2002, Paris, France, October 2002.
Philips Audio Fingerprinting Applications
music recognition over mobile phones build broadcast monitoring systems for
copyright verification commercial verification royalty metering
maintaining personal music archives allows the creation of music-aware networks, allowing
reliable control over the flow of copyrighted music an essential tool for building more secure digital audio
watermarks
Philips Audio Fingerprinting
Performance Efficient
Typically 3 seconds of music is enough to identify the music Fast
Robust Is not affacted by compression or environmental noise
Highly accurate Can discriminate between different versions of a song, even
if performed by the same artist
Philips Audio Fingerprinting
Main Challenges Unique features
These features are loosely analogous to normal fingerprints for human beings
Clever searching Algorithms to compare these audio fingerprints to large databases
of previously extracted audio fingerprints
Other Challenges Must be able to work with short segments of music The music that is offered to the fingerprint extractor is of poor quality The easiest way to lose customers is to provide poor performance the cell-phone recognition service must make economical sense
http://www.research.philips.com/InformationCenter/Global/FArticleSummary.asp?lNodeId=927
http://www.soundfisher.com
Keislar, D., T. Blum, T., J. Wheaton, & E. Wold,. “A content-ware sound browser”. Proc. of the International Computer Music Conference, ICMA, 1999.
SoundFisher Musle Fish
SoundFisher Supports Traditional Text and Numeric Queries
Performs text searches on keywords, comments, composer or performer names, and user-defined fields.
Searches and sorts by file attributes, such as sample rate, number of channels, etc.
Powerful "Sounds-like" Queries Finds sounds by example using selected sounds or user-defined s
ound categories. Searches and sorts sounds by similarity, based on pitch, loudness,
brightness and/or overall timbre But, SoundFisher is not intended as a search mechanism for musi
c catalogs. In other words, they do not address sound at the level of the musical phrase, melody, rhythm or tempo.
Permits Flexible Categories Supports sound categories that contain nested categories. No need to conform to predefined categories.
Musle Fish
IRCAM Studio-On-Line An content-based
search & classification interface
The "Sound Palette" provides access to an instrumental sound database, as well as a Web site on Instruments and their playing modes.
IRCAM Studio-On-Line
Offers a primitive search-by-perceptual-similarity function Search available sounds through high-level criteria (sound
categories, dynamic profiles, timbre similarity, etc.) Offers manual and automatic sound labeling and
classification Learns classification criteria from user-provided sample training
sets, and then performs automatic classification of newly entered samples among the learned classes
Access to audio material in various formats mp3 streaming, audio files, and compressed archives download
Music Recognition Timbre recognition Singer recognition Melody recognition Rhythm recognition Genre recognition Mood recognition
Challenges:
• Multirepresentational Challenge
• Multicultural Challenge
• Multiexperiential Challenge
Unique featuresClever searching
"Sounds-like" queries and categories
search-by-perceptual-similarity
Audio Fingerprinting
SoundFisher
Studio-On-Line
Music Recognition
Timbre recognition Find solo violin pieces Which works use the following combination of instruments?
Singer recognition Find songs of Sting Who sings the song I just played? I want to find some songs for karaoke, where the singer’s voice
is similar to my own
Music Recognition
Melody recognition Find songs with a similar melody to what I am listening to right
now Rhythm recognition
Find songs with a dance rhythm Genre recognition
Find song whose style is similar with the one I am listening to now
Mood recognition Find a soft song to calm myself after a hard day at UST (Univ. of
Stress & Tension)
Music Recognition Automatic Music Transcription Systems
Transcribe soundfiles to high level musical notation (MIDI files or sheet scores)
Music Annotation and Indexing Segment audio streams and assign symbols to indicate the
content of the segment. MPEG-7 descriptors
The MPEG description consists of semantic descriptors (e.g., type of music), and perceptual features describing the audio content.
Video Indexing and Retrieval Audio is an important component in video indexing and retrieval.
The four basic perceptual attributes of sound Pitch / Fundamental frequency Loudness / Amplitude Duration Timbre
Definition: Timbre is the quality of a sound by which a listener can judge that
two sounds of the same loudness and pitch are dissimilar.
Instrument Recognition Recognize instrument families and individual instruments
Timbre
Five classes of instruments The strings
Instruments with strings that are played by touching them with a bow or plectrum.
violin, viola, cello, double bass, guitar The brass
Wind instruments made of brass trumpet, trombone, French horn, tuba
The woodwinds Wind instruments often made of wood (flutes and reeds)
Double reeds, clarinets, saxophones, flutes, piccolo, bassoon The percussion
Timpani, marimba, drums, cymbal, gong The keyboard
Harpsichord, piano, organ
Timbre Recognition Instrument families
Strings, brass, woodwinds, percussion, keyboard A taxonomic hierarchy
Piano
Piano
All Instruments
Released Sustained
Pizzicato strings
Guitar Violin Viola Cello Double Bass
Bowed strings
Violin Viola Cello Double Bass
Flute or Piccolo
FluteAlto fluteBass flutePiccolo
Brass or Reeds
Reeds
Oboe English horn Bassoon Clarinet Saxophone
Brass
TrumpetFrench hornTromboneTuba
Human Recognition System
sensory transduction
auditory grouping
analysis of features
matching with lexicon
meaning&significance lexicon of names
recognition
McAdam’s model of human auditory processing
Little is know about the human sound source recognition system
Recognition Systems
Classification
TrainingS(n) Pre-
processing Feature Extractor
Multi-level representation
Classifier
Model Training
Instrument model
Temporal features
Cepstral features
Spectral features
Other features Timbre
Recognition Systems
Monophonic recognition Polyphonic recognition
Single note, professionally recorded or synthesized with high fidelity
Overlapped sounds of different instruments played together, a duet, a trio, or an orchestral piece
Simple, but includes most of the fundamental techniques
Difficult, more complex techniques such as pitch tracking and source separation needed
Can be used to evaluate timbre features, because timbre is especially obvious when there is only one note
More practical and useful since most of the music recordings are polyphonic
Many evaluation sample collections exist, but still incomplete
No good sample collections. Usually evaluated with very small dataset
Recognition Systems Evaluation Criteria
Accuracy The system should be able to recognize different kinds of instrume
nts with high accuracy. Generality
The recognition should not depend on a particular performer and the particular acoustic environment.
Robustness The system should ideally be able to handle real world sounds with
noise, reverberation, and competing sound sources. Scalability
The system should be able to accept a new sound source and learn to recognize it without decreasing the system performance. When new sound sources are continually introduced to the system, the performance should decrease gradually.
Realtime The system should be able to recognize a source in realtime
Recognition Systems
Evaluation data collections Monophonic collections
Using one sample collection in the evaluation Using several sample collections in the evaluation
McGill University Master Samples (MUMS)University of Iowa Musical Instrument SamplesIRCAM Studio-On-Line Samples (IRCAM SOL)RWC Music Database
No good sample collections for polyphonic music Researchers have used their own data collections in the evaluation.
Use single data collection in the evaluation
Features Accuracy Evaluation data
Kaminskyi & Voumard 96
7 98% 19 instruments: the instruments are very different and note range is small
Martin and Kim 98
31 70% / 90% 1023 sounds of 14 instruments in McGill collection
Fujinaga 98 7 50.3% Over 1300 sounds of 39 timbres from 23 instruments in McGill collectionFujinaga 99 20 64%
Fujinaga 00 22 68%
Eronen & Klapuri 00
43 80% / 94% 1498 sounds of 30 instruments from McGill collection
Petters & Rodet
81 86% / 89% 1400 sounds of 14 instruments from IRCAM SOL
27 81% / 87%
Use multiple data collections in the evaluation
Martin 99 31 39% / 76% 1500 sounds of 27 instruments from three sources: McGill, MIT music Library’s compact disc collection, and recordings made especially for this project
Eronen 01 38 35% / 77% 5286 sounds 0f 29 instruments from five sources: McGill, Tampere guitar collection, UIowa, IRCAM SOL, and Roland XP-30 synthesizer
Livshine, Petters & Rodet
N/A N/A 1325 sounds of 16 instruments from five sources: IRCAM SOL, UIowa, McGill, Prosonus and Vitus collections
Feature Extraction
A violin note
Temporal Features
Spectral Features
Cepstral Featuers
An audio clip
s1 s2 s3 sn sM
S1 S2 S3 Sn SM
DFT
Temporal Features Frame features
Amplitude & Loudness Root Mean Square
Short time Energy
1
0
2 )(1
)(N
in iS
NnRMS
1
0
2)]1()([1
)(N
in iNwiS
NnSTE
Clip Features Combination of frame features
Mean and standard deviation of RMS
Feature Extraction
A violin note
Temporal Features
Spectral Features
Cepstral Featuers
An audio clip
s1 s2 s3 sn sM
S1 S2 S3 Sn SM
DFT
The shape of the spectral envelope is closely related to timbre
Spectral features can describe some of the spectral envelope
Spectral Envelope
ω
Sn(ω)
Spectral Features
Spectral Moments First Order Moment / Frequency Centroid
Center frequency weighted by squared amplitude
N
i in
N
i iniM
S
Sn
0
2
0
2
)(
)()(
k
k
ω
Sn(ω)
Spectral Features
Spectral Centroid Moments Weighted average difference between spectral components an
d frequency centroid Band-width: square root of the second order centroid moment Skewness: third order centroid moment
N
i in
N
i ink
MikC
S
Sn
0
2
0
2
)(
)()()(
ω
Sn(ω)
Spectral Features
Subband Energy and Subband Energy Ratio Analogous to frequency bands in human ears Represent the energy distribution of the spectrum
ω
Sn(ω)
1/8 1/8 1/4 1/2
))(log()( j
j
H
Linj SnE
j j
jj nE
nEnER
)(
)()(and
Spectral Features
Spectral Irregularity Represents the jaggedness of spectral envelope
ω
Sn(ω)
1
0
2
2
0
2
)(
))1()(()( N
i n
N
i nn
iS
iSiSnI
Spectral Features
Formant Features Formant Frequency & Formant Amplitude
The position and amplitude of first two formants are the most important
Pitch : Fundamental frequency Tristimulus
The percentage of the low-order formants compared to the higher ones
ω
Sn(ω)
Cepstral Coefficients Source-filter Model
Source: periodic excitation of strings Filter: the resonator, body of an instrument
The shape of the filter spectrum represents the spectral envelope How to extract the filter properties ― Cepstral Coefficients
Source signal Filter Output signal
white noise
excitation spectrum
filter spectrum signal spectrum
Feature Extraction
A violin note
Temporal Features
Spectral Features
Cepstral Featuers
An audio clip
s1 s2 s3 sn sM
S1 S2 S3 Sn SM
DFT
Mel-frequency Cepstral CoefficientsSignal s
Frame separating
Preprocessing
Windowing
DFT
Mel-scaling
Logarithm
DCT
Cepstrum
Spectrum
Mel Scaling Human auditory system perceives sound
logarithmically
Discrete Cosine Transform (DCT) DCT is taken to separate the filter and exci
tation properties. The low-order cepstrum is the compact re
presentation of the filter
Feature Evaluation –Sample Collections
Evaluated by sample collections Build a recognition system for each evaluated feature or feature
set Use sample collections to calculate the system performance Evaluate features by the system performance (accuracy)
Advantages Easy to carry out There are free sample collections
Feature Evaluation –Sample Collections
Disadvantages Diversity of the Music
Some sample collections don’t have same properties as other sample collections
Using not enough sample collections decreases the generality of the system. The accuracy of these systems is skewed.
These systems do not satisfy the generality criterion Since we do not know how many sample collections are needed,
it may not be reliable to use incomplete sample collections to: Evaluate a recognition system Evaluate the effectiveness of a feature