Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure

Music Classification Using Neural Networks

Leo Y. Liu, Lu Wang, Yang Yu

UNC STOR

GTZAN Dataset

I 1000 audio tracks each about 30 seconds long

I 10 types: Blues, Classical, Country, Disco, Hiphop, Jazz, Metal, Pop,Reggae and Rock

I 22050Hz Mono 16-bit audio files in .wav format

https://drive.google.com/open?id=0BzPvXAjSgVbXLUxsSWc0c2k1MXM.

https://drive.google.com/open?id=0BzPvXAjSgVbXLUxsSWc0c2k1MXM

GTZAN Dataset

5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco.

Figure 1: Correlation Plot

GTZAN Dataset

Figure 2: Periodogram

GTZAN Dataset

Figure 3: Gabor Transformation

Model: Neural NetworkA neural network is a two-stage regression or classification model, typicallyrepresented by a network diagram as following.

Zm = σ(b0m + wTmX ),m = 1, . . . ,M,

Tk = b′0k + βTk Z , k = 1, . . . ,K ,

Yk = fk(X ) = gk(T ), k = 1, . . . ,K .

Model: Neural Network

I Activation function σ(v): sigmoid σ(v) = 1/(1 + e−v ).

I Output function gk(T ): For regression, identity function. In K -classclassification, gk(T ) = eTk /

∑Kl=1 e

Tl (softmax).

I Unknowns: bias and weights {b0m,wm;m = 1, . . . ,M} and{b′0k , βk ; k = 1, . . . ,K}. In total, M(p + 1) + K(M + 1) unknowns.

I Measure of fit: the sum-of-squared error

R(θ) =K∑

k=1

N∑i=1

(Y(i)k − fk(X (i)))2.

can be used for both regression and classification; the cross-entropy

R(θ) = −N∑i=1

K∑k=1

Y(i)k log fk(X (i))

for classification and the corresponding classifier is G(x) = argmaxk fk(X ).

Model: Deep Neural Network / Deep Learning

I Deep neural network/deeplearning: a neural networkwith more than one layers;

I Many variations includingrecurrent neural network(RNN), auto-encoder (AE),convolutional neural network(CNN);

I The variations are generallymodification of the layerstructure, activation functionand input-output flow.

Model: Convolutional Neural Network

It is a deep network with special types of hidden layer: convolutional layer,pooling layer, and fully-connected layer (same hidden layer in regular neuralnetworks).

CNN: Convolutional Layer

I Applying the element-wise product between a convolutional kernel (amatrix) and the corresponding regions in the input matrix. Sum them theproducts up and add a bias term as the input of the next layer.

I Move the kernel along certain direction and with certain stride size.

I Possibly need zero padding.

CNN: Convolutional Layer Continued...Same weights and bias are used for each of the 3× 3 hidden neurons.

See http://cs231n.github.io/convolutional-networks/ for anautomation illustration.

http://cs231n.github.io/convolutional-networks/

CNN: ReLU

Rectified linear unit:

I Activation layer with max(0, x).

I Sparsity and feature selection.

CNN: Pooling Layer

I Down-sample the input layer;

I Max pooling (most popular), average pooling;

I Applying filtering on local regions.

Application: Regression

When time series data shows nonlinearity, we can use neural network to build aneural network autoregression (NNAR) instead of AR. An NNAR(p,K) is aneural network with Xt−1, . . . ,Xt−p as inputs, K neurons in the hidden layerand Xt as the output. Following is an NNAR(4, 6).

Application: Regression

Application: Classification

I Handwriting recognition

I Music classification

I so on...

GTZAN Music genres classificationI n = 1000;

I T = 22, 050 ∗ 25 = 551, 250;

I K = 10;

I Using 80% training 20% testing;

I Preprocessing;

I Classification.

Mel-frequency cepstrum coefficients (MFCC)MFCC’s characterize the short-term power spectrum of a sound;

1. Take the Fourier transform of a windowed excerpt of a signal.

2. Map the powers of the spectrum obtained above onto the mel scale,using triangular overlapping kernel weights.

m = 2595 log10

(1 +

f

700

)= 1127 ln

(1 +

f

700

),

3. Take the logs of the powers at each of the mel frequencies.

4. Take the discrete cosine transform of the list of mel log powers, as if itwere a signal.

5. Extract MFCCs as the amplitudes of the resulting spectrum.

I Apply triangle kernel weight on given frequencies to compute the powerspectrum.

I Bandwidth is equal in mel scale, and different in original scale. (small inlow frequency and large in high frequency).

Note: window width can be either overlapped or non-overlapped, we usedwindow width of 100 ms with stride size of 25 ms.

Advantages:

I Approximates the human auditory system’s response.Demo in http://www.apronus.com/music/flashpiano.htm

I Downsample the raw data by sampling in the a few frequencies(20hz-8000hz).Demo in https://en.wikipedia.org/wiki/Audio_frequency

I Utilize the local information, both in time domain and frequency domain.

http://www.apronus.com/music/flashpiano.htm

https://en.wikipedia.org/wiki/Audio_frequency

Final Model

I Attempted typical CNN, but got disappointing results...

I Low image features in MFCC’s matrix;I Algorithm not converged;I Li et al. (2010) used 2 hours to training a CNN to classify only 3 genres.

I A deep fully connected CNN, implemented in MATLAB, trained in lessthan 2 minutes.

Implemented in Matlab. Only a few lines of codes, and less than five minutesof training.

layers = [imageInputLayer([21 997 1])

fullyConnectedLayer(2000)




fullyConnectedLayer(n_class)

softmaxLayer

classificationLayer()];

Binary results

blues classical country disco hiphop jazz metal pop reggae rockblues 100.0% 97.5% 72.5% 82.5% 70.0% 87.5% 80.0% 90.0% 72.5% 72.5%

classical 97.5% 100.0% 92.5% 95.0% 100.0% 87.5% 100.0% 100.0% 97.5% 97.5%country 72.5% 92.5% 100.0% 82.5% 77.5% 77.5% 95.0% 85.0% 85.0% 65.0%disco 82.5% 95.0% 82.5% 100.0% 70.0% 95.0% 92.5% 80.0% 72.5% 70.0%hiphop 70.0% 100.0% 77.5% 70.0% 100.0% 82.5% 90.0% 82.5% 72.5% 65.0%jazz 87.5% 87.5% 77.5% 95.0% 82.5% 100.0% 97.5% 97.5% 80.0% 87.5%metal 80.0% 100.0% 95.0% 92.5% 90.0% 97.5% 100.0% 92.5% 100.0% 92.5%pop 90.0% 100.0% 85.0% 80.0% 82.5% 97.5% 92.5% 100.0% 90.0% 92.5%

reggae 72.5% 97.5% 85.0% 72.5% 72.5% 80.0% 100.0% 90.0% 100.0% 77.5%rock 72.5% 97.5% 65.0% 70.0% 65.0% 87.5% 92.5% 92.5% 77.5% 100.0%

I Overall above 80%.

I Lower accuracies: country vs. blues (72.5%), hiphop vs. blues (70%),hiphop vs. disco (70%), rock vs. blues (72.5%), country vs. rock (65%),and hiphop vs. rock (65%).

I classical, metal and pop are the three most distinguishable genres;

I blues, country and rock are the three least distinguishable genres.

Multi-category results

classical metal pop

Target Class

classical

metal

pop

Ou

tpu

t C

lass

Confusion Matrix

1931.7%

11.7%

00.0%

95.0%5.0%

00.0%

2033.3%

00.0%

100%0.0%

11.7%

23.3%

1728.3%

85.0%15.0%

95.0%5.0%

87.0%13.0%

100%0.0%

93.3%6.7%

classical jazz metal pop

Target Class

classical

jazz

metal

pop

Ou

tpu

t C

lass

Confusion Matrix

1822.5%

22.5%

00.0%

00.0%

90.0%10.0%

11.3%

1620.0%

11.3%

22.5%

80.0%20.0%

00.0%

00.0%

2025.0%

00.0%

100%0.0%

11.3%

00.0%

22.5%

1721.3%

85.0%15.0%

90.0%10.0%

88.9%11.1%

87.0%13.0%

89.5%10.5%

88.8%11.3%

blues classical disco metal pop

Target Class

blues

classical

disco

metal

pop

Ou

tpu

t C

lass

Confusion Matrix

1313.0%

33.0%

11.0%

22.0%

11.0%

65.0%35.0%

00.0%

1919.0%

00.0%

11.0%

00.0%

95.0%5.0%

33.0%

00.0%

1111.0%

11.0%

55.0%

55.0%45.0%

11.0%

00.0%

22.0%

1717.0%

00.0%

85.0%15.0%

22.0%

22.0%

44.0%

22.0%

1010.0%

50.0%50.0%

68.4%31.6%

79.2%20.8%

61.1%38.9%

73.9%26.1%

62.5%37.5%

70.0%30.0%

blues classical country disco hiphop jazz metal pop reggae rock

Target Class

blues

classical

country

disco

hiphop

jazz

metal

pop

reggae

rock

Ou

tpu

t C

lass

Confusion Matrix

84.0%

00.0%

52.5%

21.0%

00.0%

00.0%

42.0%

00.0%

00.0%

10.5%

40.0%60.0%

00.0%

178.5%

10.5%

00.0%

00.0%

10.5%

00.0%

00.0%

00.0%

10.5%

85.0%15.0%

73.5%

10.5%

73.5%

10.5%

00.0%

10.5%

00.0%

10.5%

10.5%

10.5%

35.0%65.0%

00.0%

00.0%

10.5%

105.0%

31.5%

00.0%

21.0%

31.5%

00.0%

10.5%

50.0%50.0%

10.5%

10.5%

31.5%

31.5%

42.0%

00.0%

63.0%

10.5%

10.5%

00.0%

20.0%80.0%

00.0%

10.5%

10.5%

31.5%

00.0%

84.0%

10.5%

00.0%

42.0%

21.0%

40.0%60.0%

00.0%

00.0%

00.0%

10.5%

00.0%

00.0%

189.0%

10.5%

00.0%

00.0%

90.0%10.0%

10.5%

00.0%

10.5%

00.0%

00.0%

00.0%

21.0%

168.0%

00.0%

00.0%

80.0%20.0%

31.5%

10.5%

21.0%

21.0%

31.5%

00.0%

10.5%

00.0%

52.5%

31.5%

25.0%75.0%

31.5%

00.0%

00.0%

52.5%

10.5%

00.0%

52.5%

21.0%

21.0%

21.0%

10.0%90.0%

34.8%65.2%

81.0%19.0%

33.3%66.7%

37.0%63.0%

36.4%63.6%

80.0%20.0%

46.2%53.8%

66.7%33.3%

38.5%61.5%

18.2%81.8%

47.5%52.5%

Conclusion

I the deep neural network yields competitive classification accuracy.

I advantage: the prediction power;

I disadvantage: the interpretability;

I the MFCC’s capture the key features in the musical audio signals.

Thank You!

Documents

Music Classification Using Neural Networks › timeseries › Nonlinear_4_NNs.pdfGTZAN Dataset 5 songs are picked form each of the 4 genres: Blues, Classical, Country, Disco. Figure