Combining Audio Content and Social Context for Semantic Music Discovery

Combining Audio Content and Social Context for Semantic Music Discovery

José Carlos Delgado Ramos

Universidad Católica San Pablo

I. Introduction

II. Sources of Music Information

III. Combining multiple sources of music information

IV. Experiments

Introduction

• Most music IR system focus on either content-based analysis of audio signals

Introduction

• Or content-based analysis of webpages…

Introduction

• …user preference information…

Introduction

• … and social tagging data.

Tags

• Short text-based tokens• Helpful when describing songs

Tags

• Not always accurate, the strength of the semantic association betwen each song and each tag may vary.

Sources of semantic information

• Surveys

• Social tagging websites

• Annotation games

Relevance of tags to songs

• May be determined by using content-based audio analysis or by text-mining associated web documents.

Main sources for information retrieval

• Audio content, Social tags and Web documents

• Also used audio signal analysis by using two acoustic feature representations related to timbre and harmony.

Sources of Music Information

• A relevance score function r(s;t) is derived; evaluates the relevance of a song s to a tag t.

• Song-tag representations are dense if based on audio content, sparse if based on social representations.

Representing Audio Content: Supervised Multiclass Labeling (SML)

• Audio track s represented as a bag of feature vectors X = {x1,x2,…,xT}

• 1: Expectation maximization algorithm • 2: Identify set of example songs with a given tag.• 3: Mixture-hiearchies expectation maximization

algorithm.

Representing Audio Content: Supervised Multiclass Labeling (SML)

• Given a song s, X is extracted and likehood is evaluated using each of the tag GMMs.

• Result: vector or probabilites. Relevance of song s to a tag t may be written as:

Representing Audio Content: Audio feature representations

• Mel Frequency Cepstral Coefficients (MFCC): associated with musical notion of timbre.

• Chroma: represents the armonic content (keys, chords) by computing spectral energy at frequences corresponding to chromatic scale.

Representing Social Context:

• Summarize each song with annotation vector over a vocabulary of tags.

• Methods for retrieval tags: social & web-mined.• Missing song-tag pair: Tag not relevant or

relevant but not annotated.

Representing Social Context:Social Tags

• Last.FM: Music discovery website.• 20 million users a month annotate 3.8 million

items over 50 million times using a 1.2 million tags universe.

• Last.FM db: 150 million songs/16 million artists.



• Two lists of social Last.FM tags for each song: relating song to tags, and relating artist to tags.

• Relevance Tsocial(s,t) = artist list tag scores + songs lists tag scores + tag score for synonyms or wildcard matches of t on either list.

Representing Social Context:Web-Mined Tags

• Relevance Scoring (RS) algorithm.• Relevance function is a function of tag-

frequency, document frequency, number of total words in documents, etc

• Site-specific queries in HQ web-sites.• Steps: Collect Document Corpus and Tag songs

Combining multiple sources ofmusic information

• Given a query tag t, goal: fin a simple rank ordering of songs based on relevance to t.

• Tag-score, web-relevance score and convex optimization used.

• Three algorithms: supervised, use labeled traning data for learning.

Calibrated Score Averaging (CSA)

• Using training data, we can learn a function g() that calibrates scores such that

• To learn g(), we start with a rank-ordered training set of N songs where

• If data is is perfectly ordered, then g is isotonic. Otherwise:

Calibrated Score Averaging (CSA)

• E.g. 7 songs with relevant scores (1,2,4,5,6,7,9) and ground truth levels = (0,1,0,1,1,0,1)

• Then g(r) = 0 for r < 2, g(r) = ½ for 3<=r<6, g(r) = 2/3 for 6<=r<9 and g(r) = 1 for 9<=r.

• Missing song tags scores suggests tag isn’t relevant. Instead:

Rankboost algorithm

• For a given song, weak ranking function is n indicator functions that outputs 1 if the scoe for the associated representation is greater than the threshold or if the score is missing and the default value is set to 1. Otherwise 0.

Kernel Combination SVM (KC-SVM)

• Linear combination of M different kernels that each encode different data features:

• Since each kernel matrix, Km is positive semi-definite, their positive-weighted sum, K is also a valid positive semi-definite kernel.


• Km represents similarities between all songs in the data set, after vectors X = {x1,x2,…,xT} obtained from MFCC and Chroma. Compute the entries of a probability product kernel (PPK)


• For each of the social context features, a radial basis function (RBF) function is computed, with entries:

• Where K(i,j) represents the similaritybetween xi and xj, the annotation vectors for songs i and j.


• For each tag t and corresponding class-label vector, y, the primal problem for single-kernel SVM is to find the decision boundary with maximum margin separating the two clases..

• Optimum K can be learned by minimizing the function that optimizes the dual (thereby maximizing hte margin) with respect to the kernel weights .


• Where and e is an n-vector of ones such that constrains the weights tu sum to one. C is a hyper parameter that limits violations of the margin.


• The solution returns a linear decision function that defines the distance of a new song sz, from the hyperplane boundary between the positive and negative classes (i.e. elevance of sz to tag t)

• b: offset of the decision boundary from the region.

Semantic Music Retrieval Experiments

• 500 songs by 500 unique artists, each annotated by a minimum of 3 individual from a 174-tag vocabulary.

• Song annotated: 80% agree with tag relevance.• Experiment: 72 tags associated with at least 20

songs each.

Thanks!

Documents

Combining Audio Content and Social Context for Semantic Music Discovery