31
Music Classification Using Neural Networks Leo Y. Liu, Lu Wang, Yang Yu UNC STOR

Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Music Classification Using Neural Networks

Leo Y. Liu, Lu Wang, Yang Yu

UNC STOR

Page 2: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

GTZAN Dataset

I 1000 audio tracks each about 30 seconds long

I 10 types: Blues, Classical, Country, Disco, Hiphop, Jazz, Metal, Pop,Reggae and Rock

I 22050Hz Mono 16-bit audio files in .wav format

https://drive.google.com/open?id=0BzPvXAjSgVbXLUxsSWc0c2k1MXM.

Page 3: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

GTZAN Dataset

5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco.

Figure 1: Correlation Plot

Page 4: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

GTZAN Dataset

Figure 2: Periodogram

Page 5: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

GTZAN Dataset

Figure 3: Gabor Transformation

Page 6: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Model: Neural NetworkA neural network is a two-stage regression or classification model, typicallyrepresented by a network diagram as following.

Zm = σ(b0m + wTmX ),m = 1, . . . ,M,

Tk = b′0k + βTk Z , k = 1, . . . ,K ,

Yk = fk(X ) = gk(T ), k = 1, . . . ,K .

Page 7: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Model: Neural Network

I Activation function σ(v): sigmoid σ(v) = 1/(1 + e−v ).

I Output function gk(T ): For regression, identity function. In K -classclassification, gk(T ) = eTk /

∑Kl=1 e

Tl (softmax).

I Unknowns: bias and weights {b0m,wm;m = 1, . . . ,M} and{b′0k , βk ; k = 1, . . . ,K}. In total, M(p + 1) + K(M + 1) unknowns.

I Measure of fit: the sum-of-squared error

R(θ) =K∑

k=1

N∑i=1

(Y(i)k − fk(X (i)))2.

can be used for both regression and classification; the cross-entropy

R(θ) = −N∑i=1

K∑k=1

Y(i)k log fk(X (i))

for classification and the corresponding classifier is G(x) = argmaxk fk(X ).

Page 8: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Model: Deep Neural Network / Deep Learning

I Deep neural network/deeplearning: a neural networkwith more than one layers;

I Many variations includingrecurrent neural network(RNN), auto-encoder (AE),convolutional neural network(CNN);

I The variations are generallymodification of the layerstructure, activation functionand input-output flow.

Page 9: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Model: Convolutional Neural Network

It is a deep network with special types of hidden layer: convolutional layer,pooling layer, and fully-connected layer (same hidden layer in regular neuralnetworks).

Page 10: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

CNN: Convolutional Layer

I Applying the element-wise product between a convolutional kernel (amatrix) and the corresponding regions in the input matrix. Sum them theproducts up and add a bias term as the input of the next layer.

I Move the kernel along certain direction and with certain stride size.

I Possibly need zero padding.

Page 11: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

CNN: Convolutional Layer Continued...Same weights and bias are used for each of the 3× 3 hidden neurons.

See http://cs231n.github.io/convolutional-networks/ for anautomation illustration.

Page 12: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

CNN: ReLU

Rectified linear unit:

I Activation layer with max(0, x).

I Sparsity and feature selection.

Page 13: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

CNN: Pooling Layer

I Down-sample the input layer;

I Max pooling (most popular), average pooling;

I Applying filtering on local regions.

Page 14: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Application: Regression

When time series data shows nonlinearity, we can use neural network to build aneural network autoregression (NNAR) instead of AR. An NNAR(p,K) is aneural network with Xt−1, . . . ,Xt−p as inputs, K neurons in the hidden layerand Xt as the output. Following is an NNAR(4, 6).

Page 15: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Application: Regression

Page 16: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Application: Classification

I Handwriting recognition

I Music classification

I so on...

Page 17: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

GTZAN Music genres classificationI n = 1000;

I T = 22, 050 ∗ 25 = 551, 250;

I K = 10;

I Using 80% training 20% testing;

I Preprocessing;

I Classification.

Page 18: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Mel-frequency cepstrum coefficients (MFCC)MFCC’s characterize the short-term power spectrum of a sound;

1. Take the Fourier transform of a windowed excerpt of a signal.

2. Map the powers of the spectrum obtained above onto the mel scale,using triangular overlapping kernel weights.

m = 2595 log10

(1 +

f

700

)= 1127 ln

(1 +

f

700

),

3. Take the logs of the powers at each of the mel frequencies.

4. Take the discrete cosine transform of the list of mel log powers, as if itwere a signal.

5. Extract MFCCs as the amplitudes of the resulting spectrum.

Page 19: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

I Apply triangle kernel weight on given frequencies to compute the powerspectrum.

I Bandwidth is equal in mel scale, and different in original scale. (small inlow frequency and large in high frequency).

Page 20: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Note: window width can be either overlapped or non-overlapped, we usedwindow width of 100 ms with stride size of 25 ms.

Page 21: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Advantages:

I Approximates the human auditory system’s response.Demo in http://www.apronus.com/music/flashpiano.htm

I Downsample the raw data by sampling in the a few frequencies(20hz-8000hz).Demo in https://en.wikipedia.org/wiki/Audio_frequency

I Utilize the local information, both in time domain and frequency domain.

Page 22: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Final Model

I Attempted typical CNN, but got disappointing results...

I Low image features in MFCC’s matrix;I Algorithm not converged;I Li et al. (2010) used 2 hours to training a CNN to classify only 3 genres.

I A deep fully connected CNN, implemented in MATLAB, trained in lessthan 2 minutes.

Page 23: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Implemented in Matlab. Only a few lines of codes, and less than five minutesof training.

layers = [imageInputLayer([21 997 1])

fullyConnectedLayer(2000)

fullyConnectedLayer(1000)

fullyConnectedLayer(500)

fullyConnectedLayer(250)

fullyConnectedLayer(n_class)

softmaxLayer

classificationLayer()];

Page 24: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Binary results

blues classical country disco hiphop jazz metal pop reggae rockblues 100.0% 97.5% 72.5% 82.5% 70.0% 87.5% 80.0% 90.0% 72.5% 72.5%

classical 97.5% 100.0% 92.5% 95.0% 100.0% 87.5% 100.0% 100.0% 97.5% 97.5%country 72.5% 92.5% 100.0% 82.5% 77.5% 77.5% 95.0% 85.0% 85.0% 65.0%disco 82.5% 95.0% 82.5% 100.0% 70.0% 95.0% 92.5% 80.0% 72.5% 70.0%hiphop 70.0% 100.0% 77.5% 70.0% 100.0% 82.5% 90.0% 82.5% 72.5% 65.0%jazz 87.5% 87.5% 77.5% 95.0% 82.5% 100.0% 97.5% 97.5% 80.0% 87.5%metal 80.0% 100.0% 95.0% 92.5% 90.0% 97.5% 100.0% 92.5% 100.0% 92.5%pop 90.0% 100.0% 85.0% 80.0% 82.5% 97.5% 92.5% 100.0% 90.0% 92.5%

reggae 72.5% 97.5% 85.0% 72.5% 72.5% 80.0% 100.0% 90.0% 100.0% 77.5%rock 72.5% 97.5% 65.0% 70.0% 65.0% 87.5% 92.5% 92.5% 77.5% 100.0%

I Overall above 80%.

I Lower accuracies: country vs. blues (72.5%), hiphop vs. blues (70%),hiphop vs. disco (70%), rock vs. blues (72.5%), country vs. rock (65%),and hiphop vs. rock (65%).

I classical, metal and pop are the three most distinguishable genres;

I blues, country and rock are the three least distinguishable genres.

Page 25: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Multi-category results

classical metal pop

Target Class

classical

metal

pop

Ou

tpu

t C

lass

Confusion Matrix

1931.7%

11.7%

00.0%

95.0%5.0%

00.0%

2033.3%

00.0%

100%0.0%

11.7%

23.3%

1728.3%

85.0%15.0%

95.0%5.0%

87.0%13.0%

100%0.0%

93.3%6.7%

Page 26: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

classical jazz metal pop

Target Class

classical

jazz

metal

pop

Ou

tpu

t C

lass

Confusion Matrix

1822.5%

22.5%

00.0%

00.0%

90.0%10.0%

11.3%

1620.0%

11.3%

22.5%

80.0%20.0%

00.0%

00.0%

2025.0%

00.0%

100%0.0%

11.3%

00.0%

22.5%

1721.3%

85.0%15.0%

90.0%10.0%

88.9%11.1%

87.0%13.0%

89.5%10.5%

88.8%11.3%

Page 27: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure
Page 28: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

blues classical disco metal pop

Target Class

blues

classical

disco

metal

pop

Ou

tpu

t C

lass

Confusion Matrix

1313.0%

33.0%

11.0%

22.0%

11.0%

65.0%35.0%

00.0%

1919.0%

00.0%

11.0%

00.0%

95.0%5.0%

33.0%

00.0%

1111.0%

11.0%

55.0%

55.0%45.0%

11.0%

00.0%

22.0%

1717.0%

00.0%

85.0%15.0%

22.0%

22.0%

44.0%

22.0%

1010.0%

50.0%50.0%

68.4%31.6%

79.2%20.8%

61.1%38.9%

73.9%26.1%

62.5%37.5%

70.0%30.0%

Page 29: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

blues classical country disco hiphop jazz metal pop reggae rock

Target Class

blues

classical

country

disco

hiphop

jazz

metal

pop

reggae

rock

Ou

tpu

t C

lass

Confusion Matrix

84.0%

00.0%

52.5%

21.0%

00.0%

00.0%

42.0%

00.0%

00.0%

10.5%

40.0%60.0%

00.0%

178.5%

10.5%

00.0%

00.0%

10.5%

00.0%

00.0%

00.0%

10.5%

85.0%15.0%

73.5%

10.5%

73.5%

10.5%

00.0%

10.5%

00.0%

10.5%

10.5%

10.5%

35.0%65.0%

00.0%

00.0%

10.5%

105.0%

31.5%

00.0%

21.0%

31.5%

00.0%

10.5%

50.0%50.0%

10.5%

10.5%

31.5%

31.5%

42.0%

00.0%

63.0%

10.5%

10.5%

00.0%

20.0%80.0%

00.0%

10.5%

10.5%

31.5%

00.0%

84.0%

10.5%

00.0%

42.0%

21.0%

40.0%60.0%

00.0%

00.0%

00.0%

10.5%

00.0%

00.0%

189.0%

10.5%

00.0%

00.0%

90.0%10.0%

10.5%

00.0%

10.5%

00.0%

00.0%

00.0%

21.0%

168.0%

00.0%

00.0%

80.0%20.0%

31.5%

10.5%

21.0%

21.0%

31.5%

00.0%

10.5%

00.0%

52.5%

31.5%

25.0%75.0%

31.5%

00.0%

00.0%

52.5%

10.5%

00.0%

52.5%

21.0%

21.0%

21.0%

10.0%90.0%

34.8%65.2%

81.0%19.0%

33.3%66.7%

37.0%63.0%

36.4%63.6%

80.0%20.0%

46.2%53.8%

66.7%33.3%

38.5%61.5%

18.2%81.8%

47.5%52.5%

Page 30: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Conclusion

I the deep neural network yields competitive classification accuracy.

I advantage: the prediction power;

I disadvantage: the interpretability;

I the MFCC’s capture the key features in the musical audio signals.

Page 31: Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Thank You!