Auto playlist

Math 267 // Professor M BremerClass Project

Automatically Generated Music PlaylistAustin Powell, Abhirupa Sen

AbstractThe goal of our paper is to create an algorithm that generates a short music playlist based off of at least onesong. Our algorithm was an interpretation of the paper Learning a Gaussian Process Prior for AutomaticallyGenerating Music Playlists [3]. The foundation of our algorithm is Gaussian process regression that uses asimilarity kernel function between songs. Our model was different from the one published in this paper largelydue to our data characteristics being mostly continuous unlike the purely categorical data used in the paper.However, the final goal was of generating a short playlist sorted by the predicted user preference for the songs.

Contents

Introduction 1

1 Data Characteristics 1

1.1 Attribute Variables . . . . . . . . . . . . . . . . . . . . . 2

2 Model Selection and Interpretation 2

2.1 Kernel Meta Training . . . . . . . . . . . . . . . . . . . 22.2 Kernel Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Generation of the playlist . . . . . . . . . . . . . . . . 3

2.4 Choice of h,σ2n and σ2

f . . . . . . . . . . . . . . . . . . 3

3 Results 4

4 Summary and Concluding Remarks 4

References 4

5 Appendix 6

IntroductionOrientation Before digital music’s popularity, playlist gen-eration was a manual and perhaps tedious process. Profes-sionally, a playlist was created, for use on a radio station (forexample) by selecting songs that were popular; individually,a user might create a playlist on songs that matches theirparticular mood, energy level or tempo. In any case, the pro-cess of manually creating a playlist was (and still is) highlysubjective and creation of the playlist is from songs that thecreator had prior knowledge of.

There are numerous commercial websites that are used forgenerating a music playlist. The generation of these playlistsis based on some sort of criteria selected by the user either byselection of a mood, favorite artist, style or song. Althoughthe general idea of all these websites is similar, few if anyhave actually published their algorithm.

Our project is based on such published algorithm calledAutoDJ. AutoDJ is a system for “automatically generatingplaylists at the time that a user wants to listen to music.” [3]The original system is expected to learn from as little as one to

three songs what mood/energy-level/tempo a user would wantin their playlist. Although our process deviates on several keypoints (primarily due to the nature of our data), our overallgoal is the same as AutoDJ’s.

Key Aspects The key aspect of our algorithm is KernelMeta training KMT. Since, the algorithm is supposed to inter-pret the user preference on the basis of very few seed songs,the underlying Gaussian process uses a kernel that is trainedusing a sample of songs. The kernel learns the similaritybetween songs with respect to different attributes assignedto each song. The set of attributes for a song constitutes themeta data for the song.

This is one of the key areas that our project differs fromAutoDJ’s. As we did not have access to the original dataset used to Meta-Train AutoDJ, we found a data set fromthe Million Song Database [3]. This data set differs fromthe original prediction variables in the paper as they are ex-clusively categorical while ours are a mix of categorical andcontinuous. Furthermore, songs, if they were on the playlist,after the initial seed song, were assumed to have a value of1 (for the user liking a particular song), while ours could beanywhere above zero for the number of times the user waspredicted to want to list to a particular song.

Plan We will address further the nature of our data, whereit came from and potential for further development. In ourmodel section, we use the general Gaussian Regression Pro-cess and preserve the spirit of the original paper’s model andgenerate a reasonable playlist.

1. Data CharacteristicsSource of Data The Million Song Dataset is a freely-availablecollection of audio features and metadata for a million con-temporary popular music tracks. Although different formatsof the data are available, in its rawest form, the Million SongDataset provides an enormous database (280GB) with entrieson the order of hundreds of millions. As a “subset“ we down-loaded a 3GB file with approximately 49k songs. Using a

1

Automatically Generated Music Playlist — 2/6

short Perl script, a 1% subset of this text file was selectedwhich contained the API codes to retrieve attribute informa-tion from ECHO NEST [needs link] where specific musicdata information for the million song database is stored. Ourfinal data set was a randomly selected set of 1,000 entries thatincluded 10 variables (summary information provided).

1.1 Attribute VariablesThe artist names of the songs are character variables. It isobvious that songs sung by the same singer (belonging to thesame album) are closer to each other. However, the moodand other characteristics of the songs might widely vary evenwhen sung by the same singer.

Investigation of Influential Attribute Variables It shouldbe noted that in the original paper, the authors most likelyhad a much more popular playlists due to the fact that theirtraining sets were purely trained on more variables and alsotrained on “professionally created playlists”. Due to the verydifferent nature of our variables from the original data set,and also the somewhat unclear naming rules used by the EchoNest, principle component and factor analysis was performedon the attribute variables after we had determined numericallythat there was indeed correlation between tempo and loudnessfor example.

Our analysis revealed that most of the variability in theseattribute variables was due to what could be termed a weightedcontrast of tempo and loudness with energy, danceability,speechiness and instrumentalness. (see Appendix Fig. 4). Wefelt this perhaps could be described more comprehensively(based of the 1000 songs in our list that is) as a contrast of“harsh” songs with “good feeling songs”. There was also signsof contrast between energy and danceability, and speechinessand instrumentalness (see Appendix Fig. 7). We felt thiscould describe a split in user preference for genre vs userswho definitely wanted to hear songs with words or not.

One further point to be noted is the (further evidenceof outliers see Fig. 7) obvious outliers for speechiness andinstrumentalness seen in Fig. 1. Playlists generated fromthese values will naturally be worse at predicting a widespread of songs for the user that might be broad enough for 1song.

Energy: A song with 0 energy is low in energy and withenergy close to 1 is high in energy. Danceability: A songwith danceability close to 1 is good to dance on. Instrumen-talness: Some of the songs with high instrumentalness areonly instrumental. Speechiness: Measures the amount ofwords done in the song. A song with low speechiness lackswords. Tempo: These are positive numbers which could beas high as 215. Obviously, a high tempo score means a songof high tempo. Loudness: These attributes have both posi-tive and negative values. Songs with larger negative valuesloudness are softer than the ones with smaller negative orpositive values. This loudness scale used by Echo Nest [4]is a subjective variable of the intensity of a sound and notnecessarily measured in the more traditional decibels.

0

5

10

15

20

0.00 0.25 0.50 0.75 1.00value

density

variableenergy

danceability

instrumentalness

speechiness

Density Plots Contrast in Attribute Variables

Figure 1. Density plots for Good Feeling Attribute Contrastin user preference whether a user had a preference for genreor not

Variable Min 1st Qt Mean 3rd Qt Max

Energy 0.009 0.474 0.646 0.864 0.997Danceability 0.067 0.415 0.528 0.638 0.942Instrumentalness 0.000 0.000 0.151 0.068 0.974Speechiness 0.024 0.034 0.079 0.075 0.954Tempo 46.2 99.8 122.8 140.8 215.8Loudness -36.1 -10.2 -8.3 -5.2 2.1

Playcount 1 1 2.787 2 74Table 1. Summary of 6 attribute variables used inMahalanobis Distance Matrix along with response variableplaycount.

2. Model Selection and Interpretation

The response variable is the number of times a user would liketo hear a particular song, namely the play count of the song.There was no prior knowledge about the possible shape ordistribution the response variable. In such a situation the bestpossible solution was to fit a non-parametric model. Althoughthe GPM is “mostly” a non-parametric model, the intrinsicvariance (for play counts) is assumed normally distributed(ε ∼ N(0,σ2

n ). No assumptions are made about the shape ofthe function f other than its continuity and smoothness [2].

2.1 Kernel Meta TrainingLike any other GP we have not made any assumptions aboutthe mean function µ(x). Hence our mean function is µ(x) =0. Our meta data is a combination of 1 categorical and 6continuous variables. The one categorical variable we have isthe artist name of the song. The continuous variables are theenergy, danceability, instrumentalness, speechiness, tempoand loudness. Since we have two types of variables our finalkernel is a combination of two distance matrices.


Mahalanobis Distance Matrix The distance between thesongs in terms of the continuous attributes is obtained usingMahalanobis distance. If xi is the vector of six continuousvariables for song i and x j is the same for song j then

DM(−→xi ,−→x j ) =

√(−→xi −−→x j )′Σ−1(−→xi −−→x j )

is the Mahalanobis distance between two songs in terms oftheir continuous variables. Each x vector is of length 6 and Sis the 6x6 covariance matrix for the 6 variables. Di j quantifiesthe distance (higher the distance less similar are the songs)between songs i and j. During the Kernel meta-training phase,a 1000x1000 matrix of the mutual distances between songs iscalculated. Obviously the matrix would have a 0 diagonal asthat is the distance of any song from itself. dm(

−→x1 ,−→x1 ) ... dm(

−→x 1,−→x 1000)

... ... ...dm(−→x 1,−→x 1000 ... dm(

−→x 1000,−→x 1000

Artist Similarity Matrix The distance between songs interms of the artist/album name is measured by another 1000x1000matrix. Songs belonging to the same artist/album are closer toeach other than the ones sung by different artists. Hence thismatrix is a binary matrix of 0s and 1s. A 0 for the (i,j)th cellmeans the ith and the jth songs belong to the same artist/album.It would be one otherwise. Obviously, the diagonals are 0.

DSimilarArtist(−→xi ,−→x j ) =

0 dartist(−→x 1,−→x2 ) ... dartist(

−→x 1,−→x 1000)

dartist(−→x2 ,−→x1 ) ... ... ...

... ... ... dartist(−→x 999,

−→x 1000)dartist(

−→x 1000,−→x1 ) 0

The distance between a pair of songs is obtained adding

the two nxn matrices, generated:

DSimilarArtist(−→xi ,−→x j )+DM(−→xi ,

−→x j )

2.2 Kernel MatrixHaving constructed the distance matrix, the kernel matrixof dimension 1000x1000 is obtained from it as a measureof the covariance between two songs. Covariance being adecreasing function of the distance between two songs, thekernel matrix has diagonals equal to 1. For any two songs iand j, the kernel is K(xi,x j) = σ2

f exp{− d2

2h2 } where d is thedistance obtained by adding the two matrices generated in theprevious steps. Sigma.f is the maximum covariance we allowthe function to have and h is the bandwidth we consider whilemeasuring the similarity between two songs.

2.3 Generation of the playlistWhenever, the user selects one or more seed songs from thelibrary, our Auto-DJ, finds twenty songs closest to the seeds.

Then the chosen songs are removed from the previously gen-erated kernel matrix. The rows calculate their covariancewith other songs in order to construct the new matrix Σ(x∗,x)where x* are those nearest to our seed. We get the expectednumber of times, the user would prefer to listen to each songin the playlist as

f̄ ∗ = Σ(x∗,x)[ΣD(x,x)+σ2n I]−1y

Σ(xi,x j) = K(xi,x j) and the y is the vector of the observedplay counts of the songs.

Our Subsample Training data size is of 1000 songs. 752songs have play counts less than or equal to 2. Of the re-maining songs, 140 have play counts less than or equal to 5.Only 108 songs have play counts above 5 and only 43 haveplay counts above 10. With play counts of 1 and 2 having75% of the total weight, it was obvious that they would pulldown the prediction values for some songs with very highplay count. As a result we never had very high predictionsfor songs with real play count as high as 14 or 33. Intuitively,when one of these songs (with play count 14) was chosen asthe seed song, the resultant playlist had relatively high playcounts for the songs in the list. Since, the songs chosen in theplaylist are supposed to be in the neighborhood of this song(with play count = 14) in terms of its training attributes, theirexpected play counts have been pulled up by that of the neigh-bor. However, when the seed with high play count appears inthe playlist, its expected play count is not as high as its realplay count. The kernel is trained in terms of the song’s artistand other continuous attributes. It is not necessary that allthe songs with similar attributes have high play count. Hencea song with high play count is more likely to be surroundedby songs with much lower play counts. This explains thealways low predictions that we are getting for this model. Thepredictions for songs with real play counts less than 4 aremuch better relative to the ones with play counts above 10.This was a limitation of the way our data is distributed. Wetried to get better estimates by fiddling with the values of h,σ2

n and σ2f .

2.4 Choice of h,σ2n and σ2

fIt is important to note that the true values of these parametersare unknown. The values given below are arbitrarily chosenin order to produce playlists from our 1000 songs that “madethe most sense“ (rejected plots shown in Appendix).

h Value chosen: 1.5 The higher the h value is, smoother isthe curve. Increasing the h value, helped improving the scoresof the songs with higher play counts. The function would givemore weights even to points a little farther off. This however,pulls play counts of songs which were originally low. Overall,the scores looked best with h as high as 1.5.

σ2f Our covariance factor Value chosen: 0.75.Our covari-

ance factor Value chosen: 0.75. A smaller σ f value here hasthe same effect as having a high bandwidth.


σ2n Value chosen: 0.5 Our choice of intrinsic noise is 0.5.

Again making this value low would make fitted function morewiggly as the function is allowed to have less variance andgets pulled to all the observations. This combination of h, σ fand σn seemed to give the best estimates of the play counts.σ2

n : ε ∼ N(0,σ2n )

0.0

2.5

5.0

7.5

0.6 0.7 0.8 0.9Danceability

Playcounts

Fitted line vs predicted line for playcounts

Figure 2. Final Model plot of actual playcounts (filledsquares) with predicted playcounts (triangles) and a LOESSregression through the predicted.h = 1.5,σ2

f = 0.75,σ2n = 0.75

3. ResultsIn Table 2 we give the the results from our Danceabilityplaylist. All of our playlists were generated from the highestplaycounts since we expect our user to like popular songs. Weagreed the songs generated might be called a “lounge slowdance“ list. Of course the genre from some of these songswould likely not fit into the playlist very well, but given therelative smallness of our training set (1000 entries), we felt itwas a good result.

Seed song: Bringin it. Artist: Nightmares on Wax

4. Summary and Concluding RemarksWe have derived an algorithm from the published paper Learn-ing a Gaussian Process Prior for Automatically GeneratingMusic Playlists [3] in which we created a kernel from a set offunctions related to the function that is being learned. Fromthis we generated a playlist from very little input from a fake

0.0

2.5

5.0

7.5


Playcounts

Fitted line vs predicted line for playcounts

Figure 3. Example of plot non-optimal parameters. Actualplaycounts (filled squares) with predicted playcounts(triangles) and a LOESS regression through the predicted.h = 1,σ2

f = 1,σ2n = 1

Seed Song: Bringin it Seed Artist: Nightmares on WaxSong Predicted ActualRollerblades 5.34 3Long Arm of The Law 4.34 9Family Business 4.26 2X-Unknown 4.21 1Wolves (Intro) 3.57 3Paa vauhtii 3.32 1Thugged-Out 3.13 1Stab (Album Version) 3.12 2Peace Sign / Index Down 2.99 1Dope Boy Magic 2.95 2

Table 2. First 10 songs from playlist generated fromDancability song

“user” created from songs take from the original 1000 entrieswith the highest attributes from our 6 attributes. Given thatour playlist was trained from quit a small data set, we pro-duced results of songs whose qualities were quit similar toeach other, but with enough variability so as to be interestingto a user.

References[1] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whit-

man, and Paul Lamere. The million song dataset. InProceedings of the 12th International Conference on Mu-sic Information Retrieval (ISMIR 2011), 2011.


[2] Martina Bremer. Kernel regression. In Lecture 4, GaussianProcess Models, March 2015.

[3] John C. Platt, Christopher J.C. Burges, Steven Swenson,Christopher Weare, and Alice Zheng. Learning a gaussianprocess prior for automatically generating music playlists.In Microsoft Research, December 2001.

[4] Million Song Project. The echo nest: Music databaseplatform. http://the.echonest.com/. Accessed:2010-09-30.

http://the.echonest.com/


5. Appendix

0

50

100

150

200

energy danceability instrumentalness speechiness tempo loudnessvariable

value

variableenergy

danceability

instrumentalness

speechiness

tempo

loudness

Boxplots for All Attribute Variables

Figure 4. Boxplots for 6 attribute variables demonstratingcontrasts from factor analysis

0.00

0.25

0.50

0.75

1.00

energy danceability instrumentalness speechinessvariable

value

variableenergy

danceability

instrumentalness

speechiness

Boxplots for Good Feeling Attribute Contrast

Figure 5. Boxplots for Good Feeling Attribute Contrast

0.0

2.5

5.0

7.5


Playcounts

Fitted line vs predicted line for playcounts (h=1,sigma_n = 0.75, sigma_f = 0.75)

Figure 6. Alternative choices (Predicted = Red, Emperical =Black ) for σn,σ f ,h

0.0

2.5

5.0

7.5


Playcounts

Fitted line vs predicted line for playcounts (h=1.5,sigma_n = 1, sigma_f = 1)

Figure 7. Alternative choices for (Predicted = Red,Emperical = Black ) σn,σ f ,h

Documents

Auto playlist